Cloaking Beyond IP Delivery
Posted 13 March 2006 - 09:20 PM
It got me thinking.. What are the other methods for delivering unique content to spiders vs. humans? What other pieces of information do spiders (particularly those that search for spam specifically and don't identify themselves as coming from the engines) show that could tip off a savvy developer?
This is more academic than anything else, as I can't think of a circumstance where we'd actually show varying content to SEs vs. humans.
Posted 13 March 2006 - 10:16 PM
If I were to try to figure out how to do this, I think I would serve the "standard" first page to everyone, and then try to serve additional pages based on subsequent behaviors. For example, if a visitor requests a second page milli-seconds after the first, doesn't request images, doesn't trigger a page tagging event, etc, then I'd serve them up the "search engine" version, otherwise they'd get the people version.
Which probably makes me a bad candidate for black hat SEO . Figured I'd give it an academic shot .
Posted 14 March 2006 - 03:20 AM
I'm sure they can detect the "dumb" CSS hiding tricks, inline CSS, setting visibility, setting impossible positioning, setting font colors (but not with image backgrounds). However, I doubt they would even consider having that go into the automatic crawler or having it trigger automatic penalties: there are just too many legitimate reasons to do all of that! Even if it triggered an alert it would probably be too much for them to manually check.
It's funny how you mention that cloaking / ip-delivery isn't "in" any more, I see more and more of these sites all over the web lately. ...
What really surprised me was how Google handles web-spam. Reading Matt Cutts blog you recognize that they do it more or less manually. They rely on people sending in spam-reports and follow up from there. Can you imagine their workload? There must be millions of spam-sites out there, it is impossible to manually check those all!
Another thing that really surprised me was that they are apparently not working together with the Adsense / Adwords group. Looking back at the episode with EarlGrey and Newsweek (Matt jumping in and quickly banning some of his sites, ha ha, what a show: showing that they can't handle it without user input and pressure from the press): It would be soooo simple for them to take one spam site, follow the adsense-account and just ban the sites altogether. Or even just disable the adsense-account and have it display "this site might be web-spam, contact google please" in the adsense blocks. Why don't they do that? The only reason I can see is that they're in it for the money: Who makes the most money altogether out of web-spam? Why should they bite the hand that feeds them?....
so, in the end, they wait for a user complaint about a site, perhaps they even have a threshold of 100 complaints per site before they do a manual check; and afterwards - if the site is "clean" but spammy-looking - what then?
Perhaps that is what your contacts were talking about: making technically "clean" sites that were still 100% web-spam, made-for-adsense/ypn, etc. Without a technical reason for the search engines to ban a site, they would have to come up with really good excuses when banning a site based on it's "low-quality content".
Posted 14 March 2006 - 09:00 AM
I suspect that with many of them that's just putting a positive spin on the fact that it has become harder and harder to use IP delivery without accidents and failures. You see, the first and most fundamental thing that makes cloaking effective (and different to mere redirects and hidden text) is that there is no way that anything that isn't on your list of IP addresses can get served the same thing as anything that is on that list. That made it utterly undetectable without proxying through a spider machine.
Several mentioned that IP-delivery, which used to be the most effective form of cloaking (delivering one webpage to one group of folks - maybe search engine spiders - and another to other groups) has become passe.
However, scores of various 'decloaking hazards' have emerged and indeed grown over time.
First came the translation services provided by search engines. Even back in the late nineties more than one cloaker was defeated by Altavista's babelfish showing what the spider saw on the page from its IP, and not what any other IP would get served.
Next there's the cached copy of the page that major engines keep and make available. It is possible to set a meta tag to tell the engines not to display your cached copy (needed for copyright law) but that acts like a big "check out what I'm doing" sign on your site. When you want to find cloaked sites, one would immediately start by looking at the sites with caching turned off.
So, to get around that you need to use both cloaking and use hidden text on the cloaked page that is only served to spiders. That way the cached copy will seem identical to the naked eye, and only looking at the source code of the cached copy could enable anyone to beat the cloaking.
Of course, another serious decloaking hazard is the previously the search engine would only get the cloaked version of the page. It couldn't get the uncloaked version, and so it could not automatically compare the two. However, toolbars and desktop search both offer ready means to 'sample' exactly what the user is getting if it is even one bit different to what the engine has recorded.
In addition, search engines know about cloaking now, and can easily make a spider that uses random proxies just to double-check a few sites that it feels might be likely to cloak (because of whatever suspect attributes, right down to simply being in a keyword market that is particularly likely to contain spam and cloaking).
Finally, for this brief post at least, it is becoming harder and harder to build a reliable list of spider IPs in the first place. Spiders can be programmed to act like ordinary users. In fact, they can be programmed to have specific behaviors, so one spider spends a long time on any page before following a link, and goes back to check 2 or 3 links at a visit like an uncertain person reading an entire page carefully before exploring links and using the back button. Another spider makes quick decisions regarding links, but still within human ranges. It rarely seems to go back to a previous page except by links rather than a back button. In other words, spiders that are built to not act like spiders in most ways. Spider traps can still catch them, but the traps have to be better and with more failsafes than is typical.
Posted 14 March 2006 - 12:00 PM
(just kidding LOL)
Posted 14 March 2006 - 06:23 PM
I see your point Ammon about how difficult it's getting to fool spiders and be stable with IP delivery in the long run.
John, I think you have a great point about CSS and AJAX. I note that even on the current SEOmoz site, we're doing a kind of cloaking - the source code shows text, but for many headlines and features, we have those as images in the version a user sees (granted the content is the same, but we could easily change that).
Posted 14 March 2006 - 07:37 PM
I imagine you could also set up .htaccess to block direct requests for a CSS file so it couldn't be looked at individually as well.
Could still be got at manually, but should make it tricky for a spider to do.
Posted 15 March 2006 - 03:45 AM
There are some other things I have in mind but I'm going to test them first LOL.
There are lots of ways to play that game without having to resort to IP delivery...
Posted 15 March 2006 - 07:53 PM
Even if search engines decide to start interpreting javscript, you could start cloaking AJAX calls. Deliver different data to the AJAX call depending on user-agent/ip/ability to load images, css, etc.
Now throw in some obfuscated javscript code and I really don't see search engines detecting it.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users