Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Your Robots.txt File Is Searchable By People


  • Please log in to reply
12 replies to this topic

#1 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13365 posts

Posted 13 October 2007 - 09:34 PM

We all use robots.txt on our servers to block search engines and bots from accessing files and folders.

Did you know that anyone can run a search to see what you're blocking?
Did you know that just because you block search engines from crawling files and folders, doesn't mean they are always blocked from people?

Type into the Google search field:

"robots.txt" "disallow:" filetype:txt


You'll see the robots.txt files for Google itself, government sites, Webmasterworld, Craigslist, Microsoft, and much more.

Interestingly, it showed 72,900 results, which seems low.

The kicker is that if you take a disallowed file and add it to their url, you can sometimes get into the file they think is being blocked. (In some cases, not all.)

Ex:

http://www.craigslist.org/robots.txt

Remove the "robots.txt" in the url and enter one of the disallowed files. Like "Disallow: /sss"

It would look like

http://www.craigslist.org/sss/


Google will take you to the page.

Many sites don't show anything, which is good for them. It's interesting, however, to look and see what they block. Some government sites are working hard to block email harvesters and known invasive bots. Some companies specify in detail the search engine bots they don't want coming around.

I went to Google and searched on one of my directory folders that I don't want crawled and indexed. I discovered that I can go right in there, see a list of the files in that folder, click on them and get them. I can also see the robots.txt file in the folder.

How do you protect files and folders from being accessed from their browser when they have the exact URL?

I found this interesting because there's a misconception that the robots.txt file is "hiding" things you don't want found on your server. I thought this was a good reminder of what exactly this means and doesn't mean.

If you don't want nudie pictures of you being found, hide them in the attic. :pieinface:

#2 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 14 October 2007 - 05:32 AM

Good topic!

The key concept of robots.txt is that it's advisory, not mandatory. Most bots in the world simply ignore robots.txt because they just want to scrape your content and go away. Members of the Nice Bots League actually consider the advice of the file and follow it (most of the time - I have evidence that MSNBot's sometimes doesn't obey the robots.txt file).

So to elaborate: there is a difference between advising of lack of availability, hiding, and locking content. The first option is robots.txt saying "hey, don't go there". The second is putting some content on a hard-to-guess URL that only you and others you share it with will know (a-la Flickr's way of sharing locked albums). The third one is putting up strong password protection around the content.

Of course, if you don't want "nudie pictures of you being found" don't take pictures of you nude. That's the only guarantee!

Pierre

#3 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 14 October 2007 - 06:53 AM

Ouch. Tx for the tip eKs :)

Of course it is good practice to only show the first couple of letters of the "disallowed" files and folders in your robots.txt file to keep people and bad bots guessing.

Sorta like:
User-agent: *
Disallow: /zz


But yeah that only works if you have the freedom to choose your own file and folder names :P

#4 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 October 2007 - 07:08 AM

Anything you put online, without proper password-protection, is likely to be found. If you're unlucky, it'll even end up in a search engine. The robots.txt is never a replacement for password-protection. Like Pierre said, if you don't want nude photos of yourself showing up online, don't take any :).

Imagine the following situation (not made up, I've seen it happen all too often):
- your server has log files / statistics that are publicly visible (because your hoster doesn't lock it down.........)
- you access your private files
- those accesses are logged and shown in your statistics
- a search engine happens to find the statistics (there are lots of them in the search engine results, thanks to those hosters.......)
- your robots.txt accidentally gets removed (oops, who would notice...)

It happens every day. Generally, the search engines will revert back to a last-known version of your robots.txt, but if you keep it removed for long enough, they'll assume that they can crawl everything.

With password-protection in place, your statistcs would never be found (those hosters could of course do it automatically, but that would make things hard and cause lots of complaints.....) and if your files were password protected then even with public statistics those files would never be found.

Apache servers make it really easy to add password protection - see http://www.ilovejack...-with-htaccess/

John

#5 Guest_Autocrat_*

Guest_Autocrat_*
  • Guests

Posted 14 October 2007 - 07:19 AM

Darn... glad I refreshed - as more psots crept in...

Okay... simple method of work... if you have access to your .htaccess file.

What ever files you disallow in your robots..txt file, you can also block in your .htaccess file.

This means that even if they can access your robots.txt file, see what you are wanting to "hide", they still shouldn't be able to access it.


Further, whilst using .htaccess, you can include an option to prevent directories being viewed that do not have an index file in... not index.htm/l, then no looky!).

The alternative to that is to include index.html files in every directory, so even if people do get in, they cannot browse as they tend to be automatically the index file.


I would strongly recommend looking up .htaccess if you want security, as robots.txt is only "optional"... it doesn't actually control anything control bots, and even less to control people.


NOTE: Please be carefult, test a small location first, and ensure you do not look yourself out. Once you are satisfied that you are comfoprmatable and have achieved the results you are wanting, then make a full backup and try again.
Also, make sure that you do not lockdown actual browser access).

Edited by Autocrat, 14 October 2007 - 07:22 AM.


#6 Mano70

Mano70

    Mach 1 Member

  • Members
  • 256 posts

Posted 14 October 2007 - 10:43 AM

I have evidence that MSNBot's sometimes doesn't obey the robots.txt file

It doesn't care about noindex either.

#7 AbleReach

AbleReach

    Peacekeeper Administrator

  • Admin - Top Level
  • 6457 posts

Posted 14 October 2007 - 03:04 PM

User-agent: *
Disallow: /zz

Nice tip. I didn't know that. I've used dissalow for /admin/ type directories, but /zz would be a little better.

I'm curious. For an added layer, without much added work, is this a worthwhile idea?

If I have an admin log-in dialog that I'd like to shield somewhat from hackers, what if I named the login URL page anything but index, and then did a 301 from an /admindirectory/index.zzz type file to my 401 page?

Does that make any sense at all?

#8 Pittbug

Pittbug

    Ready To Fly Member

  • Members
  • 46 posts

Posted 16 October 2007 - 10:08 PM

Doing a redirect is not secure either. If your browser can follow the redirect, so can anyone elses.

Better security options are password protection and/or allow by IP address, using the .htaccess file.

If you also access the login page from a wifi network, I would also consider investing in an ssl certificate and accessing the login page using https.

#9 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 1932 posts

Posted 16 October 2007 - 11:53 PM

If you just rename the index.php to any nonsense name.php and your directory does not show an index list then is it not the same? Someone has to guess the name and just directory/ will give a 403.

Like caPouchX.php. Who is going to guess that?

You need to remember to change any reference to index.* to whatever you named it. Double protection with .htaccess password.

Another idea for admin type scripts when using open source is to change the admin directory name as above (It is usually admin). In case there is a back door.

Edited by bobbb, 17 October 2007 - 12:05 AM.


#10 wiser3

wiser3

    Gravity Master Member

  • Members
  • 225 posts

Posted 17 October 2007 - 07:38 AM

For all dirs that i don't have content in the index.html file i use an index.html file that uses a meta-redirect to my sites home page.

#11 Pittbug

Pittbug

    Ready To Fly Member

  • Members
  • 46 posts

Posted 17 October 2007 - 03:09 PM

Someone once told me: security through obscurity is just an illusion.

If you want to secure something, block it properly, rather than just hide it a bit.

#12 AbleReach

AbleReach

    Peacekeeper Administrator

  • Admin - Top Level
  • 6457 posts

Posted 17 October 2007 - 04:07 PM

I've thought of the "disallow" that can be done with robots.txt as a privacy thing more on the line of politeness. Using disallow to hide something truly sensitive is about like the bank hiding front door keys under the welcome mat.

I think this is a good idea, partly because it would help to keep temporarily available content from cluttering up SE results, but mainly because 401's look unprofessional:
Disallow: /test-layouts

Because it draws attention to the existence of /login to anyone who looks at robots.txt, is this probably not a good idea?
Disallow: /login

Can you tell that I dislike software with a standard install that creates links to help files & whatnot on a /login default dialog page? Though it's not a major sin and won't create something unique that competes, not adding indexable content of its own to a host's site is an appreciated politeness.

Edited by AbleReach, 17 October 2007 - 04:09 PM.


#13 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 1932 posts

Posted 17 October 2007 - 06:30 PM

rather than just hide it a bit

Hiding it was an argument for the disallow in robots.txt only and not a subsitute for directory passwords and login passwords. It gives the same results without giving the name away.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users