Use Your Robots.txt To Publish Your Sitemaps Xml File
#1
Posted 11 April 2007 - 11:07 AM
Here's some proof:
http://googlewebmast...itemapsorg.html
http://www.sitemaps....l#submit_robots
Does anybody see any limitations with this technique? It also doesn't take into account that fact that you still have to ping the search engines to tell them you've updated your Sitemaps URL listing.
#2
Posted 11 April 2007 - 11:55 AM
I too found that most interesting and exciting news. I would think the pinging is now redundant, unless the spiders visit very rarely. If you check traffic logs, you will find that spiders will sometimes read only the robots.txt file even if they're allowed to read more. I think it's the most rapid way they can check that the website is still alive.
#3
Posted 11 April 2007 - 12:27 PM
Welcome to the forums, AtlasOrient :wave: - tell us more about yourself
John
#5
Posted 11 April 2007 - 01:50 PM
Hi folks, thank you kindly for the warm welcome. I'm in the SEO/technology outsourcing business. I look forward to getting to know more of you folks here!
Edited by AtlasOrient, 17 April 2007 - 09:44 AM.
#7
Posted 11 April 2007 - 03:42 PM
Come to think of it, the robots.txt is a bit of a hack anyway, outdated and ambiguous user-agent names, "Disallow: " meaning "disallow nothing", some engines accepting wildcards but not all of them, etc. Not to mention the unknown consequences of listing URLs there - are they removed from the search results or just not re-crawled? are the old listings phased out or just not updated? What happens when the server is down or busy while the engine is requesting the robots.txt? What happens when servers return an error page with code 200 for robots.txt? It's just not that clear.
Wouldn't it be nice to be able to throw that all away and start over with a clean robots-control standard? Sigh.
Something else that I was wondering about: since the sitemap must be specified with the full URL, is this automatically a way of choosing the preferred domain and protocol? Assuming "domain.com" serves the same robots.txt file as "www.domain.com" - wouldn't the domain listed in the linked sitemap file automatically be given more value? Could you also re-direct misled crawlers who happened to prefer the https version of a site (over the http-version)? That would be really neat. But will the value of the canonicals automatically be transfered to the chosen one?
John
#8
Posted 11 April 2007 - 06:59 PM
From the HTML4 Specs
Although LINK has no content, it conveys relationship information that may be rendered by user agents in a variety of ways (e.g., a tool-bar with a drop-down menu of links).
Authors may use the LINK element to provide a variety of information to search engines, including:
- Links to alternate versions of a document, written in another human language.
- Links to alternate versions of a document, designed for different media, for instance a version especially suited for printing.
- Links to the starting page of a collection of documents.
*my bolding
Edited by Adrian, 11 April 2007 - 07:00 PM.
#10
Posted 12 April 2007 - 08:01 AM
#11
Posted 12 April 2007 - 10:59 AM
However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?
#13
Posted 16 April 2007 - 12:22 AM
So today the 15th I did the http request to inform them of the sitemap.xml. Google, Yahoo, and Ask all give me the 404. Did URL encode and not.
Used: <searchengine_URL>/ping?sitemap=sitemap_url as per sitemap.org
#14
Posted 16 April 2007 - 02:08 AM
Here are the Ping addresses (remove space before using):
Ask: http://submissions.ask. com/ping?sitemap=http%3A//www.domain.com/sitemap.xml
Google: http://www.google. com/webmasters/sitemaps/ping?sitemap=http:%3A//www.domain.com/sitemap.xml
Yahoo: http://search.yahooapis. com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http://www.domain.com/sitemap.xml
Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.
John
Edited by softplus, 16 April 2007 - 02:09 AM.
#15
Posted 16 April 2007 - 10:48 AM
And for anyone following, I presume that the Google one should not have either the colon or %3A
http:%3A//
#16
Posted 16 April 2007 - 10:59 AM
However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?
Full support for regular expressions would be a FANTASTIC new feature for a major overhaul of robots.txt, in my opinion. It's very limiting as is...
#17
Posted 16 April 2007 - 12:09 PM
<searchengine_URL>/ping?sitemap=sitemap_url seems pretty simple to me. Keep the complex stuff in the sitemap file. OK. spiders.txt
#18
Posted 18 April 2007 - 03:48 PM
- The sitemaps line is independent of the user-agent. However, you can trick the search engine to use a desired sitemap file by listing multiple sitemap files and disallowing all but one (per user-agent), eg:
(note: the MSNBot is a just theory - MSN doesn't pick up sitemap files yet)User-agent: googlebot
Disallow: /sitemap2.xml
User-agent: MSNBot
Disallow: /sitemap1.xml
Sitemap: http://www.example.com/sitemap1.xml
Sitemap: http://www.example.com/sitemap2.xml
- If your server serves the same content for canonical domain versions (www/non-www) then a robots.txt with a sitemap file for a given version will not change much; it's most likely going to be interpreted differently per search engine. Use a 301 to clean it up, if you can.
John
#19
Posted 19 April 2007 - 06:54 AM
Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.
Thats pretty much what Vanessa Fox said during her recent discussion with Rand Fishkin (aka Randfish), ie that it is better to test via Webmaster Tools before utilising the Sitemaps.org way of doing things.
#20
Posted 21 April 2007 - 02:36 AM
http://search.yahooapis.com/ SiteExplorerService/V1/ping?url=http%3A%2F%2Fwww.domain.com%2Fsitemap.xml
(see http://developer.yah...er/V1/ping.html )
Thanks, Maile!
John
Reply to this topic

0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users






