Did you know that anyone can run a search to see what you're blocking?
Did you know that just because you block search engines from crawling files and folders, doesn't mean they are always blocked from people?
Type into the Google search field:
"robots.txt" "disallow:" filetype:txt
You'll see the robots.txt files for Google itself, government sites, Webmasterworld, Craigslist, Microsoft, and much more.
Interestingly, it showed 72,900 results, which seems low.
The kicker is that if you take a disallowed file and add it to their url, you can sometimes get into the file they think is being blocked. (In some cases, not all.)
Ex:
http://www.craigslist.org/robots.txt
Remove the "robots.txt" in the url and enter one of the disallowed files. Like "Disallow: /sss"
It would look like
Google will take you to the page.
Many sites don't show anything, which is good for them. It's interesting, however, to look and see what they block. Some government sites are working hard to block email harvesters and known invasive bots. Some companies specify in detail the search engine bots they don't want coming around.
I went to Google and searched on one of my directory folders that I don't want crawled and indexed. I discovered that I can go right in there, see a list of the files in that folder, click on them and get them. I can also see the robots.txt file in the folder.
How do you protect files and folders from being accessed from their browser when they have the exact URL?
I found this interesting because there's a misconception that the robots.txt file is "hiding" things you don't want found on your server. I thought this was a good reminder of what exactly this means and doesn't mean.
If you don't want nudie pictures of you being found, hide them in the attic. :pieinface:






