2 Pages V  1 2 >  
Reply to this topicStart new topic
> Googlebot Crawling Css Files!

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 4 2007, 08:17 PM
I just found this single hit in my eKstreme.com logs:

QUOTE

66.249.72.52 - - [24/Oct/2006:17:17:35 -0500] "GET /global/x.css HTTP/1.1" 200 8382 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"


The requested file /global/x.css is my current CSS file for the site. The requesting IP address really is part of Google. It was only one request since moving to the new host in March.

This is news to me. Is this the end of hidden text? Can others check their logs please?

More on my blog.

Pierre
Offline Go to the top of the page

Star Member

Group Icon
Group: Moderators
Joined: 29-December 05
Posts: 3,506
From: Novosibirsk, Russia
post Jan 4 2007, 08:29 PM
Any easy way to identify this from the raw logs? Just searching for .css might be a nasty job to search through thousands of requests smile.gif
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 4 2007, 08:33 PM
It's the best method to make sure you don't miss it smile.gif Easy is relative in this case.

It's already on Digg!
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 29-August 02
Posts: 5,751
From: Bristol, UK
post Jan 5 2007, 04:22 AM
AH, but even reading the CSS, how do they work out what is hidden text?
Anything that has display:none, or visibility hidden on it?

Well there are valid uses of those styles, and there are other ways to hide content. Using position:absolute to put text behind other elements, or moving it off to the left or right....

And aren't people still getting away with the same coloured text trick using font tags?
Offline Go to the top of the page

Star Member

Group Icon
Group: Moderators
Joined: 29-December 05
Posts: 3,506
From: Novosibirsk, Russia
post Jan 5 2007, 04:30 AM
Well, at least detecting hidden text will help. After they find it, they can estimate how gross the violation is.

If that's just a "Skip to content" link that someone made invisible, then it will be alright, I guess. If there are a couple of paragraphs of keywords somewhere off the page, then it is another issue.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 29-August 02
Posts: 5,751
From: Bristol, UK
post Jan 5 2007, 05:02 AM
QUOTE
If there are a couple of paragraphs of keywords somewhere off the page


But how will they actually work that out? They can't just read the CSS file and see somehting like p.hidden{display:none;} and think "ooh, some hidden text".
Offline Go to the top of the page

Star Member

Group Icon
Group: Moderators
Joined: 29-December 05
Posts: 3,506
From: Novosibirsk, Russia
post Jan 5 2007, 05:07 AM
- First, they'll put all the cases or patterns for hiding text (display none, visibility hidden, extra larger amounts of negative or positive margins, etc).
- Then they detect the selectors, classes and ids, associated with such styles
- Then they find those tags (selectors, classes and ids) on the page.
- Then they look at the size of text, how spammy it looks and such.
- Then they take action.

And that's assuming we are talking about external styles, not internal or inline styles, in which cases some of the steps will be missing.

Pretty simple to me, really.

This post has been edited by A.N.Onym: Jan 5 2007, 05:08 AM
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 18-November 05
Posts: 1,476
From: GMT+1
post Jan 5 2007, 05:11 AM
The spidering may have been triggered by a "manual" reviewer requesting it.... </ponder ponder>
Offline Go to the top of the page

Star Member

Group Icon
Group: Moderators
Joined: 29-December 05
Posts: 3,506
From: Novosibirsk, Russia
post Jan 5 2007, 05:13 AM
Could it be Matt, wearing a costume of a Googlebot, spidering the favorite websites? I believe he mentioned once that he does crawl the web as the Googlebot.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 29-August 02
Posts: 5,751
From: Bristol, UK
post Jan 5 2007, 05:27 AM
QUOTE
- Then they look at the size of text, how spammy it looks and such.


That's where it's very open to interpretation though....
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 18-November 05
Posts: 1,476
From: GMT+1
post Jan 5 2007, 05:45 AM
Yup. Manual review only.
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 5 2007, 08:10 AM
I've seen people from Google visit the site before, but never as Googlebot and never requesting a CSS file without a referer. They tend to use FF and the HTTP REFERER is set to the page they're viewing. They usually come from a Google search (shock, horror).

Unless of course, the manual review fetches the pages as Googlebot...

This raises the question of why did they want to manually review ekstreme.com? I don't think it's that "special"!

Pierre
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 18-November 05
Posts: 1,476
From: GMT+1
post Jan 5 2007, 08:14 AM
Dude, it has TOOLS smile.gif
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 5 2007, 08:21 AM
My comment was that a manual review would be done through a normal browser which leaves a different trail than using Googlebot.

Besides, the requests from the IP address around the time are random pages from all over the site. It doesn't look systematic to me.

Pierre
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 18-November 05
Posts: 1,476
From: GMT+1
post Jan 5 2007, 08:43 AM
Ummm, I'm sorry I wasn't more clear.

I don't think that this log hit was produced by a manual review. Heh, I like to look at my raw logs from time to time and I know the difference between a bot hit and a human one.

My suggestion/speculation is that the SEs will not spider .css files until they are triggered by some human request to do so.

I'm also quite sure that if your site is being reviewed by humans, you won't necessarily notice that in your logs. If I were a SE engineer, I'd make a local copy and dissect that first and look at the dynamic server-side stuff later. If only not to alert the webmaster being scrutinised.......
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,482
From: CHeeseland
post Jan 5 2007, 08:57 AM
oooh, time to check the logs. What about javascript files?

It would be just a matter of time until they actually do this. I imagine it works much like their other systems in that it tries to recognize problematic / fishy code and flags it for a manual review. With enough red flags or spam reports, your site will likely get a manual review. At least that's how I've puzzled it together smile.gif - that's how I would do it biggrin.gif.

There's just no way to do an automatic detection of hidden text, it's impossible. Sometimes you can't even do it manually without really searching. Simple things like style sheets for different uses (screen, print, handheld, etc) will almost always result in several elements being invisible.

It's strange ... if you overdo it with hidden text, you'll probably be tagged for stuffing keywords into your text anyway. And if you try to "get it right" you might as well just put them into your visible part of the page. Why hide it if it'll get caught anyway - or if you could have them visible?

I wonder who's going to be the first to cloak stylesheets to Google biggrin.gif.

It could also be something simple -- like trying to check the stylesheet for compatibility with the cache-display page (but I doubt it).

Is your stylesheet blocked by the robots.txt?

John
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 5 2007, 09:10 AM
QUOTE(softplus @ Jan 5 2007, 01:57 PM) *
Is your stylesheet blocked by the robots.txt?

Nope. That's the most common question. The directory where x.css resides in is not blocked by robots.txt (or anything for that matter).

I'll do JS files later tonight when I get home.

Pierre
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,482
From: CHeeseland
post Jan 5 2007, 09:18 AM
How much do you depend on ekstreme.com's Google traffic? naughty.gif

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

John
Offline Go to the top of the page

Membership Admin & Moderator

Group Icon
Group: Membership Admin & Moderator
Joined: 30-September 05
Posts: 3,326
From: Some round-ish rock floating in a vacuum.
post Jan 5 2007, 09:25 AM
QUOTE(softplus @ Jan 5 2007, 02:18 PM) *

How much do you depend on ekstreme.com's Google traffic? naughty.gif

How about cloaking them a "really bad" CSS file? Or a stylesheet which contains classes not used which are all hidden?

Ummm, no thanks. Besides, it was just the one request two months ago. Would they even notice?
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,482
From: CHeeseland
post Jan 5 2007, 09:32 AM
One request 2 months ago sounds more like a glitch however. Maybe someone had a link to your CSS file? If they were checking CSS files for presumed hidden text, they would be doing it regularly. If they needed to build a database to test with, they could have done it without using a "Googlebot" IP/user agent.

I once noticed something neater (imho): An IP from Google grabs the page, an IP from some open proxy server grabs the javascript and css files. I could see that they belonged to each other. Scary!! Then I figured out it was a Google Web Accelerator user biggrin.gif.

John
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
2 Pages V  1 2 >
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 5th September 2010 - 08:34 PM
Meet our Moderators: cre8pc : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed