Jump to content

Cre8asiteforums

Web Site Design, Usability, SEO & Marketing Discussion and Support

AtlasOrient

Use Your Robots.txt To Publish Your Sitemaps Xml File

Recommended Posts

News flash -- as per today, you no longer need to manually send an HTTP request to each search engine to inform them of where your Sitemaps protocol URL listing is. You just place the address of the XML file itself within a new line of the old & faithful Robots.txt, and voila! Google, Yahoo, Live, and Ask all can discover your Sitemaps file at that point.

 

Here's some proof:

 

http://googlewebmastercentral.blogspot.com...itemapsorg.html

http://www.sitemaps.org/protocol.html#submit_robots

 

Does anybody see any limitations with this technique? It also doesn't take into account that fact that you still have to ping the search engines to tell them you've updated your Sitemaps URL listing.

Share this post


Link to post
Share on other sites

Welcome to the Forums, AtlasOrient. :wave:

 

I too found that most interesting and exciting news. I would think the pinging is now redundant, unless the spiders visit very rarely. If you check traffic logs, you will find that spiders will sometimes read only the robots.txt file even if they're allowed to read more. I think it's the most rapid way they can check that the website is still alive.

Share this post


Link to post
Share on other sites

I've been hoping for this since the start of Sitemaps - I love it! Now all we need is a global way to get pages out of the index quickly (and of course wildcard support as part of the robots.txt standard). SES isn't over - what more will they have in store for us?

 

Welcome to the forums, AtlasOrient :wave: - tell us more about yourself :blink:

 

John

Share this post


Link to post
Share on other sites

I think it's a great idea: robots.txt now has gained a new feature to control bots. Even the syntax is easy!

 

Pierre

Share this post


Link to post
Share on other sites

What do you guys think about the fact that the Robots Exclusion Standard is being used for the explicit inclusion of URLs? Admittedly, this is abusing the purpose of this standard file (see http://robotstxt.org for the details). Should there be a different file, with a different standard for this? I think it's convenient for Robots.txt to be used, but I can understand arguments against using this file for any old URL purpose.

 

Hi folks, thank you kindly for the warm welcome. I'm in the SEO/technology outsourcing business. I look forward to getting to know more of you folks here!

Edited by AtlasOrient

Share this post


Link to post
Share on other sites

I heard about this news at the NYC SES a little while ago. I have no information, but perhaps this was picked up in SES coverage and discussed in more detail somewhere.

Share this post


Link to post
Share on other sites

I'm not that happy about it being in the robots.txt - I feel a meta-tag on the root page would have been more logical... it seems a bit like a hack, but then again, how quickly could you get webmasters to create and use yet another "standard" file? :)

 

Come to think of it, the robots.txt is a bit of a hack anyway, outdated and ambiguous user-agent names, "Disallow: " meaning "disallow nothing", some engines accepting wildcards but not all of them, etc. Not to mention the unknown consequences of listing URLs there - are they removed from the search results or just not re-crawled? are the old listings phased out or just not updated? What happens when the server is down or busy while the engine is requesting the robots.txt? What happens when servers return an error page with code 200 for robots.txt? It's just not that clear.

 

Wouldn't it be nice to be able to throw that all away and start over with a clean robots-control standard? Sigh.

 

Something else that I was wondering about: since the sitemap must be specified with the full URL, is this automatically a way of choosing the preferred domain and protocol? Assuming "domain.com" serves the same robots.txt file as "www.domain.com" - wouldn't the domain listed in the linked sitemap file automatically be given more value? Could you also re-direct misled crawlers who happened to prefer the https version of a site (over the http-version)? That would be really neat. But will the value of the canonicals automatically be transfered to the chosen one?

 

John

Share this post


Link to post
Share on other sites

Seems like something ideal for a <link> tag, in the same way you link in CSS files and stuff.

 

From the HTML4 Specs

 

Although LINK has no content, it conveys relationship information that may be rendered by user agents in a variety of ways (e.g., a tool-bar with a drop-down menu of links).

 

Authors may use the LINK element to provide a variety of information to search engines, including:
  • Links to alternate versions of a document, written in another human language.
  • Links to alternate versions of a document, designed for different media, for instance a version especially suited for printing.
  • Links to the starting page of a collection of documents.

 

 

*my bolding

Edited by Adrian

Share this post


Link to post
Share on other sites

Well it is meant for the SE bots, and robots.txt is already the place for control bot behaviour.

 

Whether robots.txt needs a revamp, that's another story (and I agree it needs an update).

 

Pierre

Share this post


Link to post
Share on other sites

Completely agree ekStreme, search engines have come along way in a very short period of time, and as such, perhaps some period of reflection is required in order to make sure the technology is fit for purpose, otherwise long term it could be a case of square peg, round hole

Share this post


Link to post
Share on other sites

I agree with Pierre & Egain -- this is a good opportunity for the Robots.txt file to be updated and serve the new technologies of today.

 

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?

Share this post


Link to post
Share on other sites

Gave this a try on the 11th and no one has read the site map. All three have read the robots.txt since.

 

So today the 15th I did the http request to inform them of the sitemap.xml. Google, Yahoo, and Ask all give me the 404. Did URL encode and not.

 

Used: <searchengine_URL>/ping?sitemap=sitemap_url as per sitemap.org

Share this post


Link to post
Share on other sites

Hi Bob

Here are the Ping addresses (remove space before using):

 

Ask: http://submissions.ask. com/ping?sitemap=http%3A//www.domain.com/sitemap.xml

Google: http://www.google. com/webmasters/sitemaps/ping?sitemap=http:%3A//www.domain.com/sitemap.xml

Yahoo: http://search.yahooapis. com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http://www.domain.com/sitemap.xml

 

Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.

 

John

Edited by softplus

Share this post


Link to post
Share on other sites

Ah! Did not use those URLs. I do use Google Webmaster Tools for one site but I wanted to see how this works. Used Gsitecrawler so I presumed the file to be OK. :)

 

And for anyone following, I presume that the Google one should not have either the colon or %3A

http:%3A//

Share this post


Link to post
Share on other sites
Guest joedolson

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?

 

Full support for regular expressions would be a FANTASTIC new feature for a major overhaul of robots.txt, in my opinion. It's very limiting as is...

Share this post


Link to post
Share on other sites

Keeping robots.txt simple is the best; as klunky as it is. Regular expressions would introduce a whole mess of problems and complexities. There are whole books written about regular expressions.

 

<searchengine_URL>/ping?sitemap=sitemap_url seems pretty simple to me. Keep the complex stuff in the sitemap file. OK. spiders.txt

Share this post


Link to post
Share on other sites

I just wanted to relay some comments made by Maile Ohye of Google - for those using this setup:

 

- The sitemaps line is independent of the user-agent. However, you can trick the search engine to use a desired sitemap file by listing multiple sitemap files and disallowing all but one (per user-agent), eg:

User-agent: googlebot

Disallow: /sitemap2.xml

 

User-agent: MSNBot

Disallow: /sitemap1.xml

 

Sitemap: http://www.example.com/sitemap1.xml

Sitemap: http://www.example.com/sitemap2.xml

(note: the MSNBot is a just theory - MSN doesn't pick up sitemap files yet)

 

- If your server serves the same content for canonical domain versions (www/non-www) then a robots.txt with a sitemap file for a given version will not change much; it's most likely going to be interpreted differently per search engine. Use a 301 to clean it up, if you can.

 

John

Share this post


Link to post
Share on other sites
Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.

 

Thats pretty much what Vanessa Fox said during her recent discussion with Rand Fishkin (aka Randfish), ie that it is better to test via Webmaster Tools before utilising the Sitemaps.org way of doing things.

Share this post


Link to post
Share on other sites

Does anyone know if it is still preferred to upload the gzipped version of the sitemap file, i.e. sitemap.xml.gz?

 

Would all the above advice work or does any of it break?

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×