Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Googlebot Crawling Css Files!


  • Please log in to reply
28 replies to this topic

#1 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 January 2007 - 08:17 PM

I just found this single hit in my eKstreme.com logs:

66.249.72.52 - - [24/Oct/2006:17:17:35 -0500] "GET /global/x.css HTTP/1.1" 200 8382 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


The requested file /global/x.css is my current CSS file for the site. The requesting IP address really is part of Google. It was only one request since moving to the new host in March.

This is news to me. Is this the end of hidden text? Can others check their logs please?

More on my blog.

Pierre

#2 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 04 January 2007 - 08:29 PM

Any easy way to identify this from the raw logs? Just searching for .css might be a nasty job to search through thousands of requests :)

#3 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 January 2007 - 08:33 PM

It's the best method to make sure you don't miss it :) Easy is relative in this case.

It's already on Digg!

#4 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 05 January 2007 - 04:22 AM

AH, but even reading the CSS, how do they work out what is hidden text?
Anything that has display:none, or visibility hidden on it?

Well there are valid uses of those styles, and there are other ways to hide content. Using position:absolute to put text behind other elements, or moving it off to the left or right....

And aren't people still getting away with the same coloured text trick using font tags?

#5 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 05 January 2007 - 04:30 AM

Well, at least detecting hidden text will help. After they find it, they can estimate how gross the violation is.

If that's just a "Skip to content" link that someone made invisible, then it will be alright, I guess. If there are a couple of paragraphs of keywords somewhere off the page, then it is another issue.

#6 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 05 January 2007 - 05:02 AM

If there are a couple of paragraphs of keywords somewhere off the page


But how will they actually work that out? They can't just read the CSS file and see somehting like p.hidden{display:none;} and think "ooh, some hidden text".

#7 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 05 January 2007 - 05:07 AM

- First, they'll put all the cases or patterns for hiding text (display none, visibility hidden, extra larger amounts of negative or positive margins, etc).
- Then they detect the selectors, classes and ids, associated with such styles
- Then they find those tags (selectors, classes and ids) on the page.
- Then they look at the size of text, how spammy it looks and such.
- Then they take action.

And that's assuming we are talking about external styles, not internal or inline styles, in which cases some of the steps will be missing.

Pretty simple to me, really.

Edited by A.N.Onym, 05 January 2007 - 05:08 AM.


#8 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 January 2007 - 05:11 AM

The spidering may have been triggered by a "manual" reviewer requesting it.... </ponder ponder>

#9 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 05 January 2007 - 05:13 AM

Could it be Matt, wearing a costume of a Googlebot, spidering the favorite websites? I believe he mentioned once that he does crawl the web as the Googlebot.

#10 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 05 January 2007 - 05:27 AM

- Then they look at the size of text, how spammy it looks and such.


That's where it's very open to interpretation though....

#11 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 January 2007 - 05:45 AM

Yup. Manual review only.

#12 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 08:10 AM

I've seen people from Google visit the site before, but never as Googlebot and never requesting a CSS file without a referer. They tend to use FF and the HTTP REFERER is set to the page they're viewing. They usually come from a Google search (shock, horror).

Unless of course, the manual review fetches the pages as Googlebot...

This raises the question of why did they want to manually review ekstreme.com? I don't think it's that "special"!

Pierre

#13 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 January 2007 - 08:14 AM

Dude, it has TOOLS :D

#14 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 08:21 AM

My comment was that a manual review would be done through a normal browser which leaves a different trail than using Googlebot.

Besides, the requests from the IP address around the time are random pages from all over the site. It doesn't look systematic to me.

Pierre

#15 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 January 2007 - 08:43 AM

Ummm, I'm sorry I wasn't more clear.

I don't think that this log hit was produced by a manual review. Heh, I like to look at my raw logs from time to time and I know the difference between a bot hit and a human one.

My suggestion/speculation is that the SEs will not spider .css files until they are triggered by some human request to do so.

I'm also quite sure that if your site is being reviewed by humans, you won't necessarily notice that in your logs. If I were a SE engineer, I'd make a local copy and dissect that first and look at the dynamic server-side stuff later. If only not to alert the webmaster being scrutinised.......

#16 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 08:57 AM

oooh, time to check the logs. What about javascript files?

It would be just a matter of time until they actually do this. I imagine it works much like their other systems in that it tries to recognize problematic / fishy code and flags it for a manual review. With enough red flags or spam reports, your site will likely get a manual review. At least that's how I've puzzled it together :) - that's how I would do it :D.

There's just no way to do an automatic detection of hidden text, it's impossible. Sometimes you can't even do it manually without really searching. Simple things like style sheets for different uses (screen, print, handheld, etc) will almost always result in several elements being invisible.

It's strange ... if you overdo it with hidden text, you'll probably be tagged for stuffing keywords into your text anyway. And if you try to "get it right" you might as well just put them into your visible part of the page. Why hide it if it'll get caught anyway - or if you could have them visible?

I wonder who's going to be the first to cloak stylesheets to Google :huh:.

It could also be something simple -- like trying to check the stylesheet for compatibility with the cache-display page (but I doubt it).

Is your stylesheet blocked by the robots.txt?

John

#17 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 09:10 AM

Is your stylesheet blocked by the robots.txt?

Nope. That's the most common question. The directory where x.css resides in is not blocked by robots.txt (or anything for that matter).

I'll do JS files later tonight when I get home.

Pierre

#18 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 09:18 AM

How much do you depend on ekstreme.com's Google traffic? :naughty:

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

John

#19 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 09:25 AM

How much do you depend on ekstreme.com's Google traffic? :naughty:

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

Ummm, no thanks. Besides, it was just the one request two months ago. Would they even notice?

#20 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 09:32 AM

One request 2 months ago sounds more like a glitch however. Maybe someone had a link to your CSS file? If they were checking CSS files for presumed hidden text, they would be doing it regularly. If they needed to build a database to test with, they could have done it without using a "Googlebot" IP/user agent.

I once noticed something neater (imho): An IP from Google grabs the page, an IP from some open proxy server grabs the javascript and css files. I could see that they belonged to each other. Scary!! Then I figured out it was a Google Web Accelerator user :D.

John

#21 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 02:35 PM

oooh, time to check the logs. What about javascript files?

Just checked and sure enough, there were requests of external JS files. I found a total of 71 requests in the two-month period between September and November.

I updated the blog post.

Anyone else checking their logs?

Pierre

#22 trevHCS

trevHCS

    Light Speed Member

  • Members
  • 662 posts

Posted 08 January 2007 - 09:41 AM

Just spotted Googlebot running requests for external JS files too which I wish they wouldn't do as it skews up my human only stats thingy! :) No idea how many they've requested as I just happened to notice it on a site that was getting a lot of bot traffic. Can't imagine what exactly they are up to as trying to understand the j/script will be a massive operation.

Then again, if Googlebot ever understood j/script, that could be good for things like AJAX.

Trev

#23 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 January 2007 - 09:50 AM

The Googlebot has been checking my CSS and JS files as well.

Funny, I did not know that the Googlebot came from Turkey. Learn something new every day. (*)

I only checked December, I'll check the later months when I get the log files off of the server....

John


(*) ok, ok, it was probably just a user browsing as a Googlebot :).

#24 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 08 January 2007 - 10:02 AM

Some fake Gbots I spotted were from India and Italy.

Glad we got more confirmation!

Pierre

#25 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 08 January 2007 - 10:35 AM

Lesson to self: running grep on the server is much faster than downloading the full logs :).

Here's my CSS log from oy-oy.eu:
2006-10-24	22:23:17	66.249.65.17	GET	/css.css	-	200
2006-10-24	22:24:34	66.249.65.17	GET	/google/aoldb/this.css	-	200
(I'm fairly certain that the time is in UTC)

Trivia: The Googlebot's user agent was broken for a day:
2006-09-07	06:05:48	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'
2006-09-07	06:32:58	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'
ex0609.log:2006-09-08	17:42:20	66.249.65.237 'Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html)'
(Note the apostrophe ('))

Here comes the killer - javascript files ... total of 131 accesses to javascript files, here's an extract:
2006-06-29	22:59:18	66.249.72.244	GET	/google/pages/ui.js
2006-08-14	07:30:30	66.249.65.173	GET	/google/searchpr/ui.js
2006-08-14	07:31:42	66.249.65.173	GET	/google/cache/ui.js
2006-08-14	07:34:04	66.249.65.173	GET	/page/jscode/jscoder.js
2006-09-07	06:05:48	66.249.65.237	GET	/google/singlepr/ui.js
2006-09-07	06:32:58	66.249.65.237	GET	/google/singlepr/ui.js
2006-09-08	17:42:20	66.249.65.237	GET	/google/pr/ui.js
2006-09-13	16:05:06	66.249.65.163	GET	/page/jscode/jscoder.js
2006-09-14	22:21:02	66.249.65.163	GET	/google/pr/ui.js
2006-09-15	18:38:43	66.249.65.163	GET	/page/jscode/jscoder.js
2006-09-15	18:38:43	66.249.65.163	GET	/page/jscode/jscoder.js
2006-09-15	19:03:54	66.249.65.163	GET	/google/pr/ui.js
2006-09-15	19:24:06	66.249.65.163	GET	/google/pages/ui.js
2006-09-21	22:13:41	66.249.66.141	GET	/us/feedbacks/showask3.js
2006-09-27	18:48:17	66.249.66.134	GET	/us/feedbacks/showask3.js
2006-10-04	02:37:48	66.249.72.140	GET	/google/pr/ui.js
2006-10-04	13:41:36	66.249.72.140	GET	/google/aoldb/show.js
2006-10-06	01:19:25	66.249.72.140	GET	/page/jscode/jscoder.js
2006-10-06	01:24:18	66.249.72.140	GET	/google/aoldb/show.js
2006-10-06	01:24:24	66.249.72.140	GET	/us/feedbacks/showask3.js
2006-10-06	18:15:03	66.249.72.140	GET	/google/cache/ui.js
2006-10-06	18:15:09	66.249.72.140	GET	/page/jscode/jscoder.js
2006-10-06	18:15:09	66.249.72.140	GET	/us/feedbacks/showask3.js
2006-10-14	00:38:08	66.249.72.236	GET	/us/feedbacks/showask3.js
2006-10-16	16:33:53	66.249.72.236	GET	/google/searchpr/ui.js
2006-10-16	16:34:20	66.249.72.236	GET	/google/pages/ui.js
2006-10-16	16:35:03	66.249.72.236	GET	/us/feedbacks/showask3.js
2006-10-17	16:39:52	66.249.65.17	GET	/google/supplemental/ui.js
2006-10-19	20:35:41	66.249.65.17	GET	/google/pr/ui.js
2006-11-02	21:47:23	66.249.65.20	GET	/google/aoldb/show.js
2006-11-02	22:44:37	66.249.65.20	GET	/us/feedbacks/showask3.js
2006-11-02	22:44:41	66.249.65.20	GET	/google/supplemental/ui.js
2006-11-03	22:40:03	66.249.66.4	GET	/google/cache/ui.js
2006-11-07	20:03:13	66.249.66.4	GET	/shared/jscombo.js
2006-11-07	21:12:11	66.249.66.4	GET	/google/pages/ui.js
2006-11-16	22:20:28	66.249.66.228	GET	/page/jscode/jscoder.js
2006-11-21	07:19:01	66.249.65.147	GET	/google/aoldb/show.js
2006-11-22	18:25:38	66.249.65.147	GET	/google/aoldb/show.js
2006-11-22	19:53:19	66.249.65.147	GET	/google/aoldb/show.js
2006-11-27	20:31:20	66.249.65.147	GET	/shared/jscombo.js
2006-11-30	22:28:07	66.249.65.51	GET	/google/pr/ui.js
2006-12-05	04:28:12	66.249.66.18	GET	/shared/jscombo.js
Those javascript files stopped being crawled around beginning of December, but it crawled new, previously unknown javascript files after that (dynamically generated javascript files with querystrings).

Sooo.... I would almost bet that the CSS files being crawled were accidental, but the javascript files have a fairly long history (oy-oy.eu went live sometime June/July) so they are probably crawled on purpose. I don't understand why it stopped crawling them in December, perhaps the dynamic javascript files (I just noticed that the script returns a content-type of "text/html", which makes them fair game for crawlers, I assume).

Perhaps.... it has more to do with searching for links in javascript strings? I do have some javascript files that contain broken up links and I have seen more or less all crawlers try to access those URL snippets a few times.

John

#26 whitemark

whitemark

    Time Traveler Member

  • 1000 Post Club
  • 1071 posts

Posted 08 January 2007 - 10:56 AM

I remember reading about this last year too - oh, wait, it's 2007 now, so that should be last to last year - 2005.

#27 trevHCS

trevHCS

    Light Speed Member

  • Members
  • 662 posts

Posted 10 January 2007 - 03:43 PM

Hold the presses - might have worked it out!

I wonder if it's anything to do with Google Code Search? On the examples there they have JS files so it would be very logical for Google to spider javascript files.

Maybe...?

Trev

#28 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 10 January 2007 - 03:57 PM

Clever B)

But I don't see any "proof" of that yet.... :)





edit: fixed url

Edited by Wit, 10 January 2007 - 03:58 PM.


#29 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 10 January 2007 - 04:06 PM

Quite a few people mentioned the Google Code theory. I searched like a madman and couldn't find any evidence.

I would love to see some though.

Pierre



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users