Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

Googlebot Crawling Css Files!


28 replies to this topic

#1 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 January 2007 - 08:17 PM

I just found this single hit in my eKstreme.com logs:

66.249.72.52 - - [24/Oct/2006:17:17:35 -0500] "GET /global/x.css HTTP/1.1" 200 8382 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


The requested file /global/x.css is my current CSS file for the site. The requesting IP address really is part of Google. It was only one request since moving to the new host in March.

This is news to me. Is this the end of hidden text? Can others check their logs please?

More on my blog.

Pierre

#2 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 04 January 2007 - 08:29 PM

Any easy way to identify this from the raw logs? Just searching for .css might be a nasty job to search through thousands of requests :)

#3 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 January 2007 - 08:33 PM

It's the best method to make sure you don't miss it :) Easy is relative in this case.

It's already on Digg!

#4 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5773 posts
  • Twitter:tychoanomaly

Posted 05 January 2007 - 04:22 AM

AH, but even reading the CSS, how do they work out what is hidden text?
Anything that has display:none, or visibility hidden on it?

Well there are valid uses of those styles, and there are other ways to hide content. Using position:absolute to put text behind other elements, or moving it off to the left or right....

And aren't people still getting away with the same coloured text trick using font tags?

#5 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 05 January 2007 - 04:30 AM

Well, at least detecting hidden text will help. After they find it, they can estimate how gross the violation is.

If that's just a "Skip to content" link that someone made invisible, then it will be alright, I guess. If there are a couple of paragraphs of keywords somewhere off the page, then it is another issue.

#6 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5773 posts
  • Twitter:tychoanomaly

Posted 05 January 2007 - 05:02 AM

If there are a couple of paragraphs of keywords somewhere off the page


But how will they actually work that out? They can't just read the CSS file and see somehting like p.hidden{display:none;} and think "ooh, some hidden text".

#7 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 05 January 2007 - 05:07 AM

- First, they'll put all the cases or patterns for hiding text (display none, visibility hidden, extra larger amounts of negative or positive margins, etc).
- Then they detect the selectors, classes and ids, associated with such styles
- Then they find those tags (selectors, classes and ids) on the page.
- Then they look at the size of text, how spammy it looks and such.
- Then they take action.

And that's assuming we are talking about external styles, not internal or inline styles, in which cases some of the steps will be missing.

Pretty simple to me, really.

Edited by A.N.Onym, 05 January 2007 - 05:08 AM.


#8 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1597 posts
  • Twitter:w1t
  • Facebook:mcmdewit

Posted 05 January 2007 - 05:11 AM

The spidering may have been triggered by a "manual" reviewer requesting it.... </ponder ponder>

#9 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 05 January 2007 - 05:13 AM

Could it be Matt, wearing a costume of a Googlebot, spidering the favorite websites? I believe he mentioned once that he does crawl the web as the Googlebot.

#10 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5773 posts
  • Twitter:tychoanomaly

Posted 05 January 2007 - 05:27 AM

- Then they look at the size of text, how spammy it looks and such.


That's where it's very open to interpretation though....

#11 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1597 posts
  • Twitter:w1t
  • Facebook:mcmdewit

Posted 05 January 2007 - 05:45 AM

Yup. Manual review only.

#12 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 08:10 AM

I've seen people from Google visit the site before, but never as Googlebot and never requesting a CSS file without a referer. They tend to use FF and the HTTP REFERER is set to the page they're viewing. They usually come from a Google search (shock, horror).

Unless of course, the manual review fetches the pages as Googlebot...

This raises the question of why did they want to manually review ekstreme.com? I don't think it's that "special"!

Pierre

#13 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1597 posts
  • Twitter:w1t
  • Facebook:mcmdewit

Posted 05 January 2007 - 08:14 AM

Dude, it has TOOLS :D

#14 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 08:21 AM

My comment was that a manual review would be done through a normal browser which leaves a different trail than using Googlebot.

Besides, the requests from the IP address around the time are random pages from all over the site. It doesn't look systematic to me.

Pierre

#15 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1597 posts
  • Twitter:w1t
  • Facebook:mcmdewit

Posted 05 January 2007 - 08:43 AM

Ummm, I'm sorry I wasn't more clear.

I don't think that this log hit was produced by a manual review. Heh, I like to look at my raw logs from time to time and I know the difference between a bot hit and a human one.

My suggestion/speculation is that the SEs will not spider .css files until they are triggered by some human request to do so.

I'm also quite sure that if your site is being reviewed by humans, you won't necessarily notice that in your logs. If I were a SE engineer, I'd make a local copy and dissect that first and look at the dynamic server-side stuff later. If only not to alert the webmaster being scrutinised.......

#16 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 08:57 AM

oooh, time to check the logs. What about javascript files?

It would be just a matter of time until they actually do this. I imagine it works much like their other systems in that it tries to recognize problematic / fishy code and flags it for a manual review. With enough red flags or spam reports, your site will likely get a manual review. At least that's how I've puzzled it together :) - that's how I would do it :D.

There's just no way to do an automatic detection of hidden text, it's impossible. Sometimes you can't even do it manually without really searching. Simple things like style sheets for different uses (screen, print, handheld, etc) will almost always result in several elements being invisible.

It's strange ... if you overdo it with hidden text, you'll probably be tagged for stuffing keywords into your text anyway. And if you try to "get it right" you might as well just put them into your visible part of the page. Why hide it if it'll get caught anyway - or if you could have them visible?

I wonder who's going to be the first to cloak stylesheets to Google :huh:.

It could also be something simple -- like trying to check the stylesheet for compatibility with the cache-display page (but I doubt it).

Is your stylesheet blocked by the robots.txt?

John

#17 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 09:10 AM

Is your stylesheet blocked by the robots.txt?

Nope. That's the most common question. The directory where x.css resides in is not blocked by robots.txt (or anything for that matter).

I'll do JS files later tonight when I get home.

Pierre

#18 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 09:18 AM

How much do you depend on ekstreme.com's Google traffic? :naughty:

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

John

#19 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 05 January 2007 - 09:25 AM

How much do you depend on ekstreme.com's Google traffic? :naughty:

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

Ummm, no thanks. Besides, it was just the one request two months ago. Would they even notice?

#20 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 09:32 AM

One request 2 months ago sounds more like a glitch however. Maybe someone had a link to your CSS file? If they were checking CSS files for presumed hidden text, they would be doing it regularly. If they needed to build a database to test with, they could have done it without using a "Googlebot" IP/user agent.

I once noticed something neater (imho): An IP from Google grabs the page, an IP from some open proxy server grabs the javascript and css files. I could see that they belonged to each other. Scary!! Then I figured out it was a Google Web Accelerator user :D.

John



Reply to this topic



  


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users