2 Pages V  1 2 >  
Reply to this topicStart new topic
> Use Your Robots.txt To Publish Your Sitemaps Xml File, Automatically publish your URLs to Google, Yahoo, Live, Ask

Member

Group: Members
Joined: 3-April 07
Posts: 12
post Apr 11 2007, 11:07 AM
News flash -- as per today, you no longer need to manually send an HTTP request to each search engine to inform them of where your Sitemaps protocol URL listing is. You just place the address of the XML file itself within a new line of the old & faithful Robots.txt, and voila! Google, Yahoo, Live, and Ask all can discover your Sitemaps file at that point.

Here's some proof:

http://googlewebmastercentral.blogspot.com...itemapsorg.html
http://www.sitemaps.org/protocol.html#submit_robots

Does anybody see any limitations with this technique? It also doesn't take into account that fact that you still have to ping the search engines to tell them you've updated your Sitemaps URL listing.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 6-March 03
Posts: 7,962
From: Langley, British Columbia, Canada
post Apr 11 2007, 11:55 AM
Welcome to the Forums, AtlasOrient. wavey.gif

I too found that most interesting and exciting news. I would think the pinging is now redundant, unless the spiders visit very rarely. If you check traffic logs, you will find that spiders will sometimes read only the robots.txt file even if they're allowed to read more. I think it's the most rapid way they can check that the website is still alive.

Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Apr 11 2007, 12:27 PM
I've been hoping for this since the start of Sitemaps - I love it! Now all we need is a global way to get pages out of the index quickly (and of course wildcard support as part of the robots.txt standard). SES isn't over - what more will they have in store for us?

Welcome to the forums, AtlasOrient wavey.gif - tell us more about yourself smile.gif

John
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,267
From: Some round-ish rock floating in a vacuum.
post Apr 11 2007, 01:28 PM
I think it's a great idea: robots.txt now has gained a new feature to control bots. Even the syntax is easy!

Pierre
Offline Go to the top of the page

Member

Group: Members
Joined: 3-April 07
Posts: 12
post Apr 11 2007, 01:50 PM
What do you guys think about the fact that the Robots Exclusion Standard is being used for the explicit inclusion of URLs? Admittedly, this is abusing the purpose of this standard file (see http://robotstxt.org for the details). Should there be a different file, with a different standard for this? I think it's convenient for Robots.txt to be used, but I can understand arguments against using this file for any old URL purpose.

Hi folks, thank you kindly for the warm welcome. I'm in the SEO/technology outsourcing business. I look forward to getting to know more of you folks here!

This post has been edited by AtlasOrient: Apr 17 2007, 09:44 AM
Offline Go to the top of the page

Founder & Administrator

Group Icon
Group: Admin - Top Level
Joined: 29-August 02
Posts: 11,644
From: Bucks County, PA
post Apr 11 2007, 02:10 PM
I heard about this news at the NYC SES a little while ago. I have no information, but perhaps this was picked up in SES coverage and discussed in more detail somewhere.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Apr 11 2007, 03:42 PM
I'm not that happy about it being in the robots.txt - I feel a meta-tag on the root page would have been more logical... it seems a bit like a hack, but then again, how quickly could you get webmasters to create and use yet another "standard" file? smile.gif

Come to think of it, the robots.txt is a bit of a hack anyway, outdated and ambiguous user-agent names, "Disallow: " meaning "disallow nothing", some engines accepting wildcards but not all of them, etc. Not to mention the unknown consequences of listing URLs there - are they removed from the search results or just not re-crawled? are the old listings phased out or just not updated? What happens when the server is down or busy while the engine is requesting the robots.txt? What happens when servers return an error page with code 200 for robots.txt? It's just not that clear.

Wouldn't it be nice to be able to throw that all away and start over with a clean robots-control standard? Sigh.

Something else that I was wondering about: since the sitemap must be specified with the full URL, is this automatically a way of choosing the preferred domain and protocol? Assuming "domain.com" serves the same robots.txt file as "www.domain.com" - wouldn't the domain listed in the linked sitemap file automatically be given more value? Could you also re-direct misled crawlers who happened to prefer the https version of a site (over the http-version)? That would be really neat. But will the value of the canonicals automatically be transfered to the chosen one?

John
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 29-August 02
Posts: 5,751
From: Bristol, UK
post Apr 11 2007, 06:59 PM
Seems like something ideal for a <link> tag, in the same way you link in CSS files and stuff.

From the HTML4 Specs

QUOTE
Although LINK has no content, it conveys relationship information that may be rendered by user agents in a variety of ways (e.g., a tool-bar with a drop-down menu of links).


QUOTE
Authors may use the LINK element to provide a variety of information to search engines, including:
  • Links to alternate versions of a document, written in another human language.
  • Links to alternate versions of a document, designed for different media, for instance a version especially suited for printing.
  • Links to the starting page of a collection of documents.


*my bolding

This post has been edited by Adrian: Apr 11 2007, 07:00 PM
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,267
From: Some round-ish rock floating in a vacuum.
post Apr 11 2007, 07:54 PM
Well it is meant for the SE bots, and robots.txt is already the place for control bot behaviour.

Whether robots.txt needs a revamp, that's another story (and I agree it needs an update).

Pierre
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 05
Posts: 121
From: UK
post Apr 12 2007, 08:01 AM
Completely agree ekStreme, search engines have come along way in a very short period of time, and as such, perhaps some period of reflection is required in order to make sure the technology is fit for purpose, otherwise long term it could be a case of square peg, round hole
Offline Go to the top of the page

Member

Group: Members
Joined: 3-April 07
Posts: 12
post Apr 12 2007, 10:59 AM
I agree with Pierre & Egain -- this is a good opportunity for the Robots.txt file to be updated and serve the new technologies of today.

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?
Offline Go to the top of the page

Moderator/Blog Editor

Group Icon
Group: Site Admin
Joined: 18-January 05
Posts: 5,375
From: Olympia WA, USA
post Apr 13 2007, 04:38 AM
Interview @ SES mentioned in this thread talks about site maps and robots.txt
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 10-March 05
Posts: 1,065
From: Montreal Canada
post Apr 16 2007, 12:22 AM
Gave this a try on the 11th and no one has read the site map. All three have read the robots.txt since.

So today the 15th I did the http request to inform them of the sitemap.xml. Google, Yahoo, and Ask all give me the 404. Did URL encode and not.

Used: <searchengine_URL>/ping?sitemap=sitemap_url as per sitemap.org
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Apr 16 2007, 02:08 AM
Hi Bob
Here are the Ping addresses (remove space before using):

Ask: http://submissions.ask. com/ping?sitemap=http%3A//www.domain.com/sitemap.xml
Google: http://www.google. com/webmasters/sitemaps/ping?sitemap=http:%3A//www.domain.com/sitemap.xml
Yahoo: http://search.yahooapis. com/SiteExplorerService/V1/updateNotification?appid=YahooDemo&url=http://www.domain.com/sitemap.xml

Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.

John

This post has been edited by softplus: Apr 16 2007, 02:09 AM
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 10-March 05
Posts: 1,065
From: Montreal Canada
post Apr 16 2007, 10:48 AM
Ah! Did not use those URLs. I do use Google Webmaster Tools for one site but I wanted to see how this works. Used Gsitecrawler so I presumed the file to be OK. smile.gif

And for anyone following, I presume that the Google one should not have either the colon or %3A
QUOTE
http:%3A//
Offline Go to the top of the page

Technical Administrator

Group Icon
Group: Technical Administrators
Joined: 8-March 06
Posts: 2,650
From: Minneapolis/Saint Paul, MN
post Apr 16 2007, 10:59 AM
QUOTE

However, should we stay with the kludgy syntax of Robots.txt? Should a whole new, parallel standard be adopted as Robots.txt is phased out? Maybe we should lobby for a Spiders.txt?


Full support for regular expressions would be a FANTASTIC new feature for a major overhaul of robots.txt, in my opinion. It's very limiting as is...
Online Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 10-March 05
Posts: 1,065
From: Montreal Canada
post Apr 16 2007, 12:09 PM
Keeping robots.txt simple is the best; as klunky as it is. Regular expressions would introduce a whole mess of problems and complexities. There are whole books written about regular expressions.

<searchengine_URL>/ping?sitemap=sitemap_url seems pretty simple to me. Keep the complex stuff in the sitemap file. OK. spiders.txt
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Apr 18 2007, 03:48 PM
I just wanted to relay some comments made by Maile Ohye of Google - for those using this setup:

- The sitemaps line is independent of the user-agent. However, you can trick the search engine to use a desired sitemap file by listing multiple sitemap files and disallowing all but one (per user-agent), eg:
QUOTE
User-agent: googlebot
Disallow: /sitemap2.xml

User-agent: MSNBot
Disallow: /sitemap1.xml

Sitemap: http://www.example.com/sitemap1.xml
Sitemap: http://www.example.com/sitemap2.xml

(note: the MSNBot is a just theory - MSN doesn't pick up sitemap files yet)

- If your server serves the same content for canonical domain versions (www/non-www) then a robots.txt with a sitemap file for a given version will not change much; it's most likely going to be interpreted differently per search engine. Use a 301 to clean it up, if you can.

John
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 05
Posts: 121
From: UK
post Apr 19 2007, 06:54 AM
QUOTE
Regardless, I would strongly suggest using the Google Webmaster Tools to submit the sitemap file to Google -- if you have any errors in the file, you will only be notified there.


Thats pretty much what Vanessa Fox said during her recent discussion with Rand Fishkin (aka Randfish), ie that it is better to test via Webmaster Tools before utilising the Sitemaps.org way of doing things.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Apr 21 2007, 02:36 AM
This is apparently the correct URL for pinging Yahoo:
http://search.yahooapis.com/ SiteExplorerService/V1/ping?url=http%3A%2F%2Fwww.domain.com%2Fsitemap.xml

(see http://developer.yahoo.com/search/siteexplorer/V1/ping.html )

Thanks, Maile!

John
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
2 Pages V  1 2 >
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 06:45 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed