Reply to this topicStart new topic
> Google Crawling Absolute Paths?

Untested

Group: Members
Joined: 3-April 05
Posts: 8
post Jun 22 2008, 05:29 PM
I was just poking around in Google Webmaster Tools "Web Crawl" and noticed a bunch of recent 404 not founds showing up. That's not too unusual for me because I'm a bit lax with link fixing. What I thought was quite strange about this is that the paths were the apache document roots instead of the normal URL paths. I have never seen this in Google Webmaster Tools for any site in all my years. Wonder if any of you seasoned experts might have an idea what is happening... Has anyone seen this before?

Example:
I see this in G Webmaster Tools:
CODE
http://www.mysitenamehere.com/home/mysiteusername/public_html/file.php


Instead of:
CODE
http://www.mysitenamehere.com/file.php


This post has been edited by Emory: Jun 24 2008, 12:28 AM
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
post Jun 25 2008, 08:26 AM
Download yourself the Xenu Link Sleuth and use it to crawl your site. There's a strong probability that you have those links erroneously on your live site somewhere. You see, Google crawls links mostly. If it finds any URL, its almost certain that it found it by finding a link to it.

There are other ways that Google can possibly get a URL, and one such is if you have the Google toolbar, or any third-party equivalent tool to submit the URL of any page you view in order to view its PageRank.

However, the 99% probability is that it followed a link. There's a small possibility that it could have found the link on some kind of open set of stats somewhere, but it really is most likely one accidental link on the site, especially if you ever code using programs like dreamweaver or anything that could have dropped in a link in the wrong format through an accidental drag-drop sometime.

Xenu Link Sleauth will ensure you can cross off that possibility and then find how Google got the link.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Jun 25 2008, 08:40 AM
In general it's like Ammon mentioned - a link somewhere on your site. What I've also seen is that a site has it's statistics publicly visible (some hosters do that to make it easy for you to find :-)) -- if that's the case, it's possible that those statistics have a link to a URL like that somewhere in them as well.

At any rate, we can't view your servers file-system (at least not if it's configured correctly), so even if we happen to stumble upon a URL like that, your error page should keep us from worrying too much about it (make sure it returns 404).

John
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 17-June 04
Posts: 1,760
From: Essex, UK
post Jun 25 2008, 08:51 AM
Off topic (a bit),

John,

Does google not like custom 404 pages? A have configured apache to direct all 404 errors to a custom page. BUt google won't allow me to verify the site with this in place.

And I occasionally get the same problem as Emory, bots looking for pages that just don't exist. Even after a link check I can't find the 'missing page'. Not particularly worried about it but it would be nice to know why.
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Jun 25 2008, 09:15 AM
O/T :-)

Sure we like custom 404 pages, but it's best if the URL returns 404 directly instead of being (301/302) redirected there. In Apache it's easy to get this wrong: if you specify the full URL for the 404 handler in your .htaccess file, it will actually redirect to that page. If that page is a static HTML page, it will likely return a code 200. On the other hand, if you specify a relative page, Apache will generally serve the 404 page instead of the missing URL, automatically setting the result code to 404.

If you have to redirect, I'm fairly certain that a 301 redirect would be the best kind of redirect, since it tells the crawlers that the old URL has changed and is no longer valid -- but even with a redirect, the final page should return result code 404.

Hope it helps!
John
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 17-June 04
Posts: 1,760
From: Essex, UK
post Jun 25 2008, 09:32 AM
Ta very much. So simple really - I shall get onto it tonight
Offline Go to the top of the page

Untested

Group: Members
Joined: 3-April 05
Posts: 8
post Jun 28 2008, 09:09 AM
Black_Knight/John,

After reading your comments, I can be fairly certain that the links were coming from some files that were left over from an old script. I removed the files and 301'd. I wasn't linking to the files but there were some external links pointing at the directory. I need to be more careful about what I leave lying around.

I'm a big fan of Xenu, btw, and yes, I'm due for a round of sleuthing smile.gif

Thanks for your help!
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 02:17 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed