Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Google Crawling Absolute Paths?


  • Please log in to reply
6 replies to this topic

#1 Emory

Emory

    Unlurked Energy

  • Members
  • 8 posts

Posted 22 June 2008 - 05:29 PM

I was just poking around in Google Webmaster Tools "Web Crawl" and noticed a bunch of recent 404 not founds showing up. That's not too unusual for me because I'm a bit lax with link fixing. What I thought was quite strange about this is that the paths were the apache document roots instead of the normal URL paths. I have never seen this in Google Webmaster Tools for any site in all my years. Wonder if any of you seasoned experts might have an idea what is happening... Has anyone seen this before?

Example:
I see this in G Webmaster Tools:
http://www.mysitenamehere.com/home/mysiteusername/public_html/file.php

Instead of:
http://www.mysitenamehere.com/file.php

Edited by Emory, 24 June 2008 - 12:28 AM.


#2 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 25 June 2008 - 08:26 AM

Download yourself the Xenu Link Sleuth and use it to crawl your site. There's a strong probability that you have those links erroneously on your live site somewhere. You see, Google crawls links mostly. If it finds any URL, its almost certain that it found it by finding a link to it.

There are other ways that Google can possibly get a URL, and one such is if you have the Google toolbar, or any third-party equivalent tool to submit the URL of any page you view in order to view its PageRank.

However, the 99% probability is that it followed a link. There's a small possibility that it could have found the link on some kind of open set of stats somewhere, but it really is most likely one accidental link on the site, especially if you ever code using programs like dreamweaver or anything that could have dropped in a link in the wrong format through an accidental drag-drop sometime.

Xenu Link Sleauth will ensure you can cross off that possibility and then find how Google got the link.

#3 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 25 June 2008 - 08:40 AM

In general it's like Ammon mentioned - a link somewhere on your site. What I've also seen is that a site has it's statistics publicly visible (some hosters do that to make it easy for you to find :-)) -- if that's the case, it's possible that those statistics have a link to a URL like that somewhere in them as well.

At any rate, we can't view your servers file-system (at least not if it's configured correctly), so even if we happen to stumble upon a URL like that, your error page should keep us from worrying too much about it (make sure it returns 404).

John

#4 fisicx

fisicx

    Sonic Boom Member

  • Hall Of Fame
  • 1856 posts

Posted 25 June 2008 - 08:51 AM

Off topic (a bit),

John,

Does google not like custom 404 pages? A have configured apache to direct all 404 errors to a custom page. BUt google won't allow me to verify the site with this in place.

And I occasionally get the same problem as Emory, bots looking for pages that just don't exist. Even after a link check I can't find the 'missing page'. Not particularly worried about it but it would be nice to know why.

#5 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 25 June 2008 - 09:15 AM

O/T :-)

Sure we like custom 404 pages, but it's best if the URL returns 404 directly instead of being (301/302) redirected there. In Apache it's easy to get this wrong: if you specify the full URL for the 404 handler in your .htaccess file, it will actually redirect to that page. If that page is a static HTML page, it will likely return a code 200. On the other hand, if you specify a relative page, Apache will generally serve the 404 page instead of the missing URL, automatically setting the result code to 404.

If you have to redirect, I'm fairly certain that a 301 redirect would be the best kind of redirect, since it tells the crawlers that the old URL has changed and is no longer valid -- but even with a redirect, the final page should return result code 404.

Hope it helps!
John

#6 fisicx

fisicx

    Sonic Boom Member

  • Hall Of Fame
  • 1856 posts

Posted 25 June 2008 - 09:32 AM

Ta very much. So simple really - I shall get onto it tonight

#7 Emory

Emory

    Unlurked Energy

  • Members
  • 8 posts

Posted 28 June 2008 - 09:09 AM

Black_Knight/John,

After reading your comments, I can be fairly certain that the links were coming from some files that were left over from an old script. I removed the files and 301'd. I wasn't linking to the files but there were some external links pointing at the directory. I need to be more careful about what I leave lying around.

I'm a big fan of Xenu, btw, and yes, I'm due for a round of sleuthing :)

Thanks for your help!



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users