Jump to content

Cre8asiteforums

Web Site Design, Usability, SEO & Marketing Discussion and Support

  • Announcements

    • cre8pc

      20 Years! Cre8asiteforums 1998 - 2018   01/18/2018

      Cre8asiteforums In Its 20th Year In case you didn't know, Internet Marketing Ninjas released many of the online forums they had acquired, such as WebmasterWorld, SEOChat, several DevShed properties and these forums back to their founders. You will notice a new user interface for Cre8asiteforums, the software was upgraded, and it was moved to a new server.  Founder, Kim Krause Berg, who was retained as forums Admin when the forums were sold, is the hotel manager here, with the help of long-time member, "iamlost" as backup. Kim is shouldering the expenses of keeping the place going, so if you have any inclination towards making a donation or putting up a banner, she is most appreciative of your financial support. 
eKstreme

Googlebot Crawling Css Files!

Recommended Posts

I just found this single hit in my eKstreme.com logs:

 

66.249.72.52 - - [24/Oct/2006:17:17:35 -0500] "GET /global/x.css HTTP/1.1" 200 8382 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

 

 

The requested file /global/x.css is my current CSS file for the site. The requesting IP address really is part of Google. It was only one request since moving to the new host in March.

 

This is news to me. Is this the end of hidden text? Can others check their logs please?

 

More on my blog.

 

Pierre

Share this post


Link to post
Share on other sites

Any easy way to identify this from the raw logs? Just searching for .css might be a nasty job to search through thousands of requests :)

Share this post


Link to post
Share on other sites

AH, but even reading the CSS, how do they work out what is hidden text?

Anything that has display:none, or visibility hidden on it?

 

Well there are valid uses of those styles, and there are other ways to hide content. Using position:absolute to put text behind other elements, or moving it off to the left or right....

 

And aren't people still getting away with the same coloured text trick using font tags?

Share this post


Link to post
Share on other sites

Well, at least detecting hidden text will help. After they find it, they can estimate how gross the violation is.

 

If that's just a "Skip to content" link that someone made invisible, then it will be alright, I guess. If there are a couple of paragraphs of keywords somewhere off the page, then it is another issue.

Share this post


Link to post
Share on other sites
If there are a couple of paragraphs of keywords somewhere off the page

 

But how will they actually work that out? They can't just read the CSS file and see somehting like p.hidden{display:none;} and think "ooh, some hidden text".

Share this post


Link to post
Share on other sites

- First, they'll put all the cases or patterns for hiding text (display none, visibility hidden, extra larger amounts of negative or positive margins, etc).

- Then they detect the selectors, classes and ids, associated with such styles

- Then they find those tags (selectors, classes and ids) on the page.

- Then they look at the size of text, how spammy it looks and such.

- Then they take action.

 

And that's assuming we are talking about external styles, not internal or inline styles, in which cases some of the steps will be missing.

 

Pretty simple to me, really.

Edited by A.N.Onym

Share this post


Link to post
Share on other sites

The spidering may have been triggered by a "manual" reviewer requesting it.... </ponder ponder>

Share this post


Link to post
Share on other sites

Could it be Matt, wearing a costume of a Googlebot, spidering the favorite websites? I believe he mentioned once that he does crawl the web as the Googlebot.

Share this post


Link to post
Share on other sites
- Then they look at the size of text, how spammy it looks and such.

 

That's where it's very open to interpretation though....

Share this post


Link to post
Share on other sites

Yup. Manual review only.

Share this post


Link to post
Share on other sites

I've seen people from Google visit the site before, but never as Googlebot and never requesting a CSS file without a referer. They tend to use FF and the HTTP REFERER is set to the page they're viewing. They usually come from a Google search (shock, horror).

 

Unless of course, the manual review fetches the pages as Googlebot...

 

This raises the question of why did they want to manually review ekstreme.com? I don't think it's that "special"!

 

Pierre

Share this post


Link to post
Share on other sites

Dude, it has TOOLS :D

Share this post


Link to post
Share on other sites

My comment was that a manual review would be done through a normal browser which leaves a different trail than using Googlebot.

 

Besides, the requests from the IP address around the time are random pages from all over the site. It doesn't look systematic to me.

 

Pierre

Share this post


Link to post
Share on other sites

Ummm, I'm sorry I wasn't more clear.

 

I don't think that this log hit was produced by a manual review. Heh, I like to look at my raw logs from time to time and I know the difference between a bot hit and a human one.

 

My suggestion/speculation is that the SEs will not spider .css files until they are triggered by some human request to do so.

 

I'm also quite sure that if your site is being reviewed by humans, you won't necessarily notice that in your logs. If I were a SE engineer, I'd make a local copy and dissect that first and look at the dynamic server-side stuff later. If only not to alert the webmaster being scrutinised.......

Share this post


Link to post
Share on other sites

oooh, time to check the logs. What about javascript files?

 

It would be just a matter of time until they actually do this. I imagine it works much like their other systems in that it tries to recognize problematic / fishy code and flags it for a manual review. With enough red flags or spam reports, your site will likely get a manual review. At least that's how I've puzzled it together :) - that's how I would do it :D.

 

There's just no way to do an automatic detection of hidden text, it's impossible. Sometimes you can't even do it manually without really searching. Simple things like style sheets for different uses (screen, print, handheld, etc) will almost always result in several elements being invisible.

 

It's strange ... if you overdo it with hidden text, you'll probably be tagged for stuffing keywords into your text anyway. And if you try to "get it right" you might as well just put them into your visible part of the page. Why hide it if it'll get caught anyway - or if you could have them visible?

 

I wonder who's going to be the first to cloak stylesheets to Google :huh:.

 

It could also be something simple -- like trying to check the stylesheet for compatibility with the cache-display page (but I doubt it).

 

Is your stylesheet blocked by the robots.txt?

 

John

Share this post


Link to post
Share on other sites
Is your stylesheet blocked by the robots.txt?

Nope. That's the most common question. The directory where x.css resides in is not blocked by robots.txt (or anything for that matter).

 

I'll do JS files later tonight when I get home.

 

Pierre

Share this post


Link to post
Share on other sites

How much do you depend on ekstreme.com's Google traffic? :naughty:

 

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

 

John

Share this post


Link to post
Share on other sites

How much do you depend on ekstreme.com's Google traffic? :naughty:

 

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

 

Ummm, no thanks. Besides, it was just the one request two months ago. Would they even notice?

Share this post


Link to post
Share on other sites

One request 2 months ago sounds more like a glitch however. Maybe someone had a link to your CSS file? If they were checking CSS files for presumed hidden text, they would be doing it regularly. If they needed to build a database to test with, they could have done it without using a "Googlebot" IP/user agent.

 

I once noticed something neater (imho): An IP from Google grabs the page, an IP from some open proxy server grabs the javascript and css files. I could see that they belonged to each other. Scary!! Then I figured out it was a Google Web Accelerator user :D.

 

John

Share this post


Link to post
Share on other sites

oooh, time to check the logs. What about javascript files?

 

Just checked and sure enough, there were requests of external JS files. I found a total of 71 requests in the two-month period between September and November.

 

I updated the blog post.

 

Anyone else checking their logs?

 

Pierre

Share this post


Link to post
Share on other sites

Just spotted Googlebot running requests for external JS files too which I wish they wouldn't do as it skews up my human only stats thingy! :) No idea how many they've requested as I just happened to notice it on a site that was getting a lot of bot traffic. Can't imagine what exactly they are up to as trying to understand the j/script will be a massive operation.

 

Then again, if Googlebot ever understood j/script, that could be good for things like AJAX.

 

Trev

Share this post


Link to post
Share on other sites

The Googlebot has been checking my CSS and JS files as well.

 

Funny, I did not know that the Googlebot came from Turkey. Learn something new every day. (*)

 

I only checked December, I'll check the later months when I get the log files off of the server....

 

John

 

 

(*) ok, ok, it was probably just a user browsing as a Googlebot :).

Share this post


Link to post
Share on other sites

Some fake Gbots I spotted were from India and Italy.

 

Glad we got more confirmation!

 

Pierre

Share this post


Link to post
Share on other sites

Lesson to self: running grep on the server is much faster than downloading the full logs :).

 

Here's my CSS log from oy-oy.eu:

 

2006-10-24	22:23:17	66.249.65.17	GET	/css.css	-	2002006-10-24	22:24:34	66.249.65.17	GET	/google/aoldb/this.css	-	200

 

(I'm fairly certain that the time is in UTC)

 

Trivia: The Googlebot's user agent was broken for a day:

 

2006-09-07	06:05:48	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'2006-09-07	06:32:58	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'ex0609.log:2006-09-08	17:42:20	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'

 

(Note the apostrophe ('))

 

Here comes the killer - javascript files ... total of 131 accesses to javascript files, here's an extract:

 

2006-06-29	22:59:18	66.249.72.244	GET	/google/pages/ui.js2006-08-14	07:30:30	66.249.65.173	GET	/google/searchpr/ui.js2006-08-14	07:31:42	66.249.65.173	GET	/google/cache/ui.js2006-08-14	07:34:04	66.249.65.173	GET	/page/jscode/jscoder.js2006-09-07	06:05:48	66.249.65.237	GET	/google/singlepr/ui.js2006-09-07	06:32:58	66.249.65.237	GET	/google/singlepr/ui.js2006-09-08	17:42:20	66.249.65.237	GET	/google/pr/ui.js2006-09-13	16:05:06	66.249.65.163	GET	/page/jscode/jscoder.js2006-09-14	22:21:02	66.249.65.163	GET	/google/pr/ui.js2006-09-15	18:38:43	66.249.65.163	GET	/page/jscode/jscoder.js2006-09-15	18:38:43	66.249.65.163	GET	/page/jscode/jscoder.js2006-09-15	19:03:54	66.249.65.163	GET	/google/pr/ui.js2006-09-15	19:24:06	66.249.65.163	GET	/google/pages/ui.js2006-09-21	22:13:41	66.249.66.141	GET	/us/feedbacks/showask3.js2006-09-27	18:48:17	66.249.66.134	GET	/us/feedbacks/showask3.js2006-10-04	02:37:48	66.249.72.140	GET	/google/pr/ui.js2006-10-04	13:41:36	66.249.72.140	GET	/google/aoldb/show.js2006-10-06	01:19:25	66.249.72.140	GET	/page/jscode/jscoder.js2006-10-06	01:24:18	66.249.72.140	GET	/google/aoldb/show.js2006-10-06	01:24:24	66.249.72.140	GET	/us/feedbacks/showask3.js2006-10-06	18:15:03	66.249.72.140	GET	/google/cache/ui.js2006-10-06	18:15:09	66.249.72.140	GET	/page/jscode/jscoder.js2006-10-06	18:15:09	66.249.72.140	GET	/us/feedbacks/showask3.js2006-10-14	00:38:08	66.249.72.236	GET	/us/feedbacks/showask3.js2006-10-16	16:33:53	66.249.72.236	GET	/google/searchpr/ui.js2006-10-16	16:34:20	66.249.72.236	GET	/google/pages/ui.js2006-10-16	16:35:03	66.249.72.236	GET	/us/feedbacks/showask3.js2006-10-17	16:39:52	66.249.65.17	GET	/google/supplemental/ui.js2006-10-19	20:35:41	66.249.65.17	GET	/google/pr/ui.js2006-11-02	21:47:23	66.249.65.20	GET	/google/aoldb/show.js2006-11-02	22:44:37	66.249.65.20	GET	/us/feedbacks/showask3.js2006-11-02	22:44:41	66.249.65.20	GET	/google/supplemental/ui.js2006-11-03	22:40:03	66.249.66.4	GET	/google/cache/ui.js2006-11-07	20:03:13	66.249.66.4	GET	/shared/jscombo.js2006-11-07	21:12:11	66.249.66.4	GET	/google/pages/ui.js2006-11-16	22:20:28	66.249.66.228	GET	/page/jscode/jscoder.js2006-11-21	07:19:01	66.249.65.147	GET	/google/aoldb/show.js2006-11-22	18:25:38	66.249.65.147	GET	/google/aoldb/show.js2006-11-22	19:53:19	66.249.65.147	GET	/google/aoldb/show.js2006-11-27	20:31:20	66.249.65.147	GET	/shared/jscombo.js2006-11-30	22:28:07	66.249.65.51	GET	/google/pr/ui.js2006-12-05	04:28:12	66.249.66.18	GET	/shared/jscombo.js

 

Those javascript files stopped being crawled around beginning of December, but it crawled new, previously unknown javascript files after that (dynamically generated javascript files with querystrings).

 

Sooo.... I would almost bet that the CSS files being crawled were accidental, but the javascript files have a fairly long history (oy-oy.eu went live sometime June/July) so they are probably crawled on purpose. I don't understand why it stopped crawling them in December, perhaps the dynamic javascript files (I just noticed that the script returns a content-type of "text/html", which makes them fair game for crawlers, I assume).

 

Perhaps.... it has more to do with searching for links in javascript strings? I do have some javascript files that contain broken up links and I have seen more or less all crawlers try to access those URL snippets a few times.

 

John

Share this post


Link to post
Share on other sites

I remember reading about this last year too - oh, wait, it's 2007 now, so that should be last to last year - 2005.

Share this post


Link to post
Share on other sites

Hold the presses - might have worked it out!

 

I wonder if it's anything to do with Google Code Search? On the examples there they have JS files so it would be very logical for Google to spider javascript files.

 

Maybe...?

 

Trev

Share this post


Link to post
Share on other sites

Clever B)

 

But I don't see any "proof" of that yet.... :)

 

 

 

 

 

edit: fixed url

Edited by Wit

Share this post


Link to post
Share on other sites

Quite a few people mentioned the Google Code theory. I searched like a madman and couldn't find any evidence.

 

I would love to see some though.

 

Pierre

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×