Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Use Your Robots.txt To Publish Your Sitemaps Xml File


  • Please log in to reply
20 replies to this topic

#1 AtlasOrient

AtlasOrient

    Ready To Fly Member

  • Members
  • 12 posts

Posted 11 April 2007 - 11:07 AM

News flash -- as per today, you no longer need to manually send an HTTP request to each search engine to inform them of where your Sitemaps protocol URL listing is. You just place the address of the XML file itself within a new line of the old & faithful Robots.txt, and voila! Google, Yahoo, Live, and Ask all can discover your Sitemaps file at that point.

Here's some proof:

http://googlewebmast...itemapsorg.html
http://www.sitemaps....l#submit_robots

Does anybody see any limitations with this technique? It also doesn't take into account that fact that you still have to ping the search engines to tell them you've updated your Sitemaps URL listing.

#2 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9008 posts

Posted 11 April 2007 - 11:55 AM

Welcome to the Forums, AtlasOrient. :wave:

I too found that most interesting and exciting news. I would think the pinging is now redundant, unless the spiders visit very rarely. If you check traffic logs, you will find that spiders will sometimes read only the robots.txt file even if they're allowed to read more. I think it's the most rapid way they can check that the website is still alive.

#3 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 11 April 2007 - 12:27 PM

I've been hoping for this since the start of Sitemaps - I love it! Now all we need is a global way to get pages out of the index quickly (and of course wildcard support as part of the robots.txt standard). SES isn't over - what more will they have in store for us?

Welcome to the forums, AtlasOrient :wave: - tell us more about yourself :blink:

John

#4 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 11 April 2007 - 01:28 PM

I think it's a great idea: robots.txt now has gained a new feature to control bots. Even the syntax is easy!

Pierre

#5 AtlasOrient

AtlasOrient

    Ready To Fly Member

  • Members
  • 12 posts

Posted 11 April 2007 - 01:50 PM

What do you guys think about the fact that the Robots Exclusion Standard is being used for the explicit inclusion of URLs? Admittedly, this is abusing the purpose of this standard file (see http://robotstxt.org for the details). Should there be a different file, with a different standard for this? I think it's convenient for Robots.txt to be used, but I can understand arguments against using this file for any old URL purpose.

Hi folks, thank you kindly for the warm welcome. I'm in the SEO/technology outsourcing business. I look forward to getting to know more of you folks here!

Edited by AtlasOrient, 17 April 2007 - 09:44 AM.


#6 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13527 posts

Posted 11 April 2007 - 02:10 PM

I heard about this news at the NYC SES a little while ago. I have no information, but perhaps this was picked up in SES coverage and discussed in more detail somewhere.

#7 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 11 April 2007 - 03:42 PM

I'm not that happy about it being in the robots.txt - I feel a meta-tag on the root page would have been more logical... it seems a bit like a hack, but then again, how quickly could you get webmasters to create and use yet another "standard" file? :)

Come to think of it, the robots.txt is a bit of a hack anyway, outdated and ambiguous user-agent names, "Disallow: " meaning "disallow nothing", some engines accepting wildcards but not all of them, etc. Not to mention the unknown consequences of listing URLs there - are they removed from the search results or just not re-crawled? are the old listings phased out or just not updated? What happens when the server is down or busy while the engine is requesting the robots.txt? What happens when servers return an error page with code 200 for robots.txt? It's just not that clear.

Wouldn't it be nice to be able to throw that all away and start over with a clean robots-control standard? Sigh.

Something else that I was wondering about: since the sitemap must be specified with the full URL, is this automatically a way of choosing the preferred domain and protocol? Assuming "domain.com" serves the same robots.txt file as "www.domain.com" - wouldn't the domain listed in the linked sitemap file automatically be given more value? Could you also re-direct misled crawlers who happened to prefer the https version of a site (over the http-version)? That would be really neat. But will the value of the canonicals automatically be transfered to the chosen one?

John

#8 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 11 April 2007 - 06:59 PM

Seems like something ideal for a <link> tag, in the same way you link in CSS files and stuff.

From the HTML4 Specs

Although LINK has no content, it conveys relationship information that may be rendered by user agents in a variety of ways (e.g., a tool-bar with a drop-down menu of links).


Authors may use the LINK element to provide a variety of information to search engines, including:

  • Links to alternate versions of a document, written in another human language.
  • Links to alternate versions of a document, designed for different media, for instance a version especially suited for printing.
  • Links to the starting page of a collection of documents.


*my bolding

Edited by Adrian, 11 April 2007 - 07:00 PM.


#9 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 11 April 2007 - 07:54 PM

Well it is meant for the SE bots, and robots.txt is already the place for control bot behaviour.

Whether robots.txt needs a revamp, that's another story (and I agree it needs an update).

Pierre

#10 egain

egain

    Gravity Master Member

  • Members
  • 121 posts

Posted 12 April 2007 - 08:01 AM

Completely agree ekStreme, search engines have come along way in a very short period of time, and as such, perhaps some period of reflection is required in order to make sure the technology is fit for purpose, otherwise long term it could be a case of square peg, round hole

#11 AtlasOrient

AtlasOrient

    Ready To Fly Member

  • Members
  • 12 posts

Posted 12 April 2007 - 10:59 AM

I agree with Pierre & Egain -- this is a good opportunity for the Robots.txt file to be updated and serve the new technologies of today.

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?

#12 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 13 April 2007 - 04:38 AM

Interview @ SES mentioned in this thread talks about site maps and robots.txt

#13 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 2058 posts

Posted 16 April 2007 - 12:22 AM

Gave this a try on the 11th and no one has read the site map. All three have read the robots.txt since.

So today the 15th I did the http request to inform them of the sitemap.xml. Google, Yahoo, and Ask all give me the 404. Did URL encode and not.

Used: <searchengine_URL>/ping?sitemap=sitemap_url as per sitemap.org

#14 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 16 April 2007 - 02:08 AM

Hi Bob
Here are the Ping addresses (remove space before using):

Ask: http://submissions.ask. com/ping?sitemap=http%3A//www.domain.com/sitemap.xml
Google: http://www.google. com/webmasters/sitemaps/ping?sitemap=http:%3A//www.domain.com/sitemap.xml
Yahoo: http://search.yahooapis. com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http://www.domain.com/sitemap.xml

Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.

John

Edited by softplus, 16 April 2007 - 02:09 AM.


#15 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 2058 posts

Posted 16 April 2007 - 10:48 AM

Ah! Did not use those URLs. I do use Google Webmaster Tools for one site but I wanted to see how this works. Used Gsitecrawler so I presumed the file to be OK. :)

And for anyone following, I presume that the Google one should not have either the colon or %3A

http:%3A//



#16 Guest_joedolson_*

Guest_joedolson_*
  • Guests

Posted 16 April 2007 - 10:59 AM

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?


Full support for regular expressions would be a FANTASTIC new feature for a major overhaul of robots.txt, in my opinion. It's very limiting as is...

#17 bobbb

bobbb

    Sonic Boom Member

  • Hall Of Fame
  • 2058 posts

Posted 16 April 2007 - 12:09 PM

Keeping robots.txt simple is the best; as klunky as it is. Regular expressions would introduce a whole mess of problems and complexities. There are whole books written about regular expressions.

<searchengine_URL>/ping?sitemap=sitemap_url seems pretty simple to me. Keep the complex stuff in the sitemap file. OK. spiders.txt

#18 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 18 April 2007 - 03:48 PM

I just wanted to relay some comments made by Maile Ohye of Google - for those using this setup:

- The sitemaps line is independent of the user-agent. However, you can trick the search engine to use a desired sitemap file by listing multiple sitemap files and disallowing all but one (per user-agent), eg:

User-agent: googlebot
Disallow: /sitemap2.xml

User-agent: MSNBot
Disallow: /sitemap1.xml

Sitemap: http://www.example.com/sitemap1.xml
Sitemap: http://www.example.com/sitemap2.xml

(note: the MSNBot is a just theory - MSN doesn't pick up sitemap files yet)

- If your server serves the same content for canonical domain versions (www/non-www) then a robots.txt with a sitemap file for a given version will not change much; it's most likely going to be interpreted differently per search engine. Use a 301 to clean it up, if you can.

John

#19 egain

egain

    Gravity Master Member

  • Members
  • 121 posts

Posted 19 April 2007 - 06:54 AM

Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.


Thats pretty much what Vanessa Fox said during her recent discussion with Rand Fishkin (aka Randfish), ie that it is better to test via Webmaster Tools before utilising the Sitemaps.org way of doing things.

#20 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 21 April 2007 - 02:36 AM

This is apparently the correct URL for pinging Yahoo:
http://search.yahooapis.com/ SiteExplorerService/V1/ping?url=http%3A%2F%2Fwww.domain.com%2Fsitemap.xml

(see http://developer.yah...er/V1/ping.html )

Thanks, Maile!

John

#21 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9008 posts

Posted 21 April 2007 - 05:58 AM

Does anyone know if it is still preferred to upload the gzipped version of the sitemap file, i.e. sitemap.xml.gz?

Would all the above advice work or does any of it break?



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users