Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Cloaking Beyond IP Delivery


  • Please log in to reply
8 replies to this topic

#1 randfish

randfish

    Hall of Fame

  • Members
  • 937 posts

Posted 13 March 2006 - 09:20 PM

In my recent travels, I spoke with some big-time black hat folks. Several mentioned that IP-delivery, which used to be the most effective form of cloaking (delivering one webpage to one group of folks - maybe search engine spiders - and another to other groups) has become passe.

It got me thinking.. What are the other methods for delivering unique content to spiders vs. humans? What other pieces of information do spiders (particularly those that search for spam specifically and don't identify themselves as coming from the engines) show that could tip off a savvy developer?

This is more academic than anything else, as I can't think of a circumstance where we'd actually show varying content to SEs vs. humans.

#2 dgeary9

dgeary9

    Mach 1 Member

  • Members
  • 334 posts

Posted 13 March 2006 - 10:16 PM

Hmm, interesting question. I know very little about the technology, or history of what was passe 5 years ago, LOL.

If I were to try to figure out how to do this, I think I would serve the "standard" first page to everyone, and then try to serve additional pages based on subsequent behaviors. For example, if a visitor requests a second page milli-seconds after the first, doesn't request images, doesn't trigger a page tagging event, etc, then I'd serve them up the "search engine" version, otherwise they'd get the people version.

Which probably makes me a bad candidate for black hat SEO :D . Figured I'd give it an academic shot :D .

#3 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 14 March 2006 - 03:20 AM

ajax and all it's cousins comes to mind :D - I haven't caught a spider grabing my ajax pages so far, they only index the static content of those pages.

Also CSS is another great possibility. There are so many ways you can use CSS combined with javascript that it is absolutely impossible for a search engine to check the validity of an element that is hidden in a stylesheet for a specific medium. I still haven't seen an engine crawl my external CSS style sheets (but I have seen some manual accesses from Google to them, lol)

Google has said that they will use a browser-based spider to render pages and extract information from that (OCR, etc). I say they're just spreading FUD. Could you imagine the resources required by something like that? ... and not to mention the security issues of running a javascript variant on a search engine... and then all the information that is behind action items, on timers, etc. - it just isn't possible to check / crawl sites like that on an automatic basis.

I'm sure they can detect the "dumb" CSS hiding tricks, inline CSS, setting visibility, setting impossible positioning, setting font colors (but not with image backgrounds). However, I doubt they would even consider having that go into the automatic crawler or having it trigger automatic penalties: there are just too many legitimate reasons to do all of that! Even if it triggered an alert it would probably be too much for them to manually check.

It's funny how you mention that cloaking / ip-delivery isn't "in" any more, I see more and more of these sites all over the web lately. ...

What really surprised me was how Google handles web-spam. Reading Matt Cutts blog you recognize that they do it more or less manually. They rely on people sending in spam-reports and follow up from there. Can you imagine their workload? There must be millions of spam-sites out there, it is impossible to manually check those all!

Another thing that really surprised me was that they are apparently not working together with the Adsense / Adwords group. Looking back at the episode with EarlGrey and Newsweek (Matt jumping in and quickly banning some of his sites, ha ha, what a show: showing that they can't handle it without user input and pressure from the press): It would be soooo simple for them to take one spam site, follow the adsense-account and just ban the sites altogether. Or even just disable the adsense-account and have it display "this site might be web-spam, contact google please" in the adsense blocks. Why don't they do that? The only reason I can see is that they're in it for the money: Who makes the most money altogether out of web-spam? Why should they bite the hand that feeds them?....

so, in the end, they wait for a user complaint about a site, perhaps they even have a threshold of 100 complaints per site before they do a manual check; and afterwards - if the site is "clean" but spammy-looking - what then?

Perhaps that is what your contacts were talking about: making technically "clean" sites that were still 100% web-spam, made-for-adsense/ypn, etc. Without a technical reason for the search engines to ban a site, they would have to come up with really good excuses when banning a site based on it's "low-quality content".

Cheers
John

#4 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 14 March 2006 - 09:00 AM

Several mentioned that IP-delivery, which used to be the most effective form of cloaking (delivering one webpage to one group of folks - maybe search engine spiders - and another to other groups) has become passe.

I suspect that with many of them that's just putting a positive spin on the fact that it has become harder and harder to use IP delivery without accidents and failures. You see, the first and most fundamental thing that makes cloaking effective (and different to mere redirects and hidden text) is that there is no way that anything that isn't on your list of IP addresses can get served the same thing as anything that is on that list. That made it utterly undetectable without proxying through a spider machine.

However, scores of various 'decloaking hazards' have emerged and indeed grown over time.

First came the translation services provided by search engines. Even back in the late nineties more than one cloaker was defeated by Altavista's babelfish showing what the spider saw on the page from its IP, and not what any other IP would get served.

Next there's the cached copy of the page that major engines keep and make available. It is possible to set a meta tag to tell the engines not to display your cached copy (needed for copyright law) but that acts like a big "check out what I'm doing" sign on your site. When you want to find cloaked sites, one would immediately start by looking at the sites with caching turned off.

So, to get around that you need to use both cloaking and use hidden text on the cloaked page that is only served to spiders. That way the cached copy will seem identical to the naked eye, and only looking at the source code of the cached copy could enable anyone to beat the cloaking.

Of course, another serious decloaking hazard is the previously the search engine would only get the cloaked version of the page. It couldn't get the uncloaked version, and so it could not automatically compare the two. However, toolbars and desktop search both offer ready means to 'sample' exactly what the user is getting if it is even one bit different to what the engine has recorded.

In addition, search engines know about cloaking now, and can easily make a spider that uses random proxies just to double-check a few sites that it feels might be likely to cloak (because of whatever suspect attributes, right down to simply being in a keyword market that is particularly likely to contain spam and cloaking).

Finally, for this brief post at least, it is becoming harder and harder to build a reliable list of spider IPs in the first place. Spiders can be programmed to act like ordinary users. In fact, they can be programmed to have specific behaviors, so one spider spends a long time on any page before following a link, and goes back to check 2 or 3 links at a visit like an uncertain person reading an entire page carefully before exploring links and using the back button. Another spider makes quick decisions regarding links, but still within human ranges. It rarely seems to go back to a previous page except by links rather than a back button. In other words, spiders that are built to not act like spiders in most ways. Spider traps can still catch them, but the traps have to be better and with more failsafes than is typical.

#5 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 14 March 2006 - 12:00 PM

...but then again: why would we tell YOU?

(just kidding LOL)

#6 randfish

randfish

    Hall of Fame

  • Members
  • 937 posts

Posted 14 March 2006 - 06:23 PM

Great replies (except Wit :))

I see your point Ammon about how difficult it's getting to fool spiders and be stable with IP delivery in the long run.

John, I think you have a great point about CSS and AJAX. I note that even on the current SEOmoz site, we're doing a kind of cloaking - the source code shows text, but for many headlines and features, we have those as images in the version a user sees (granted the content is the same, but we could easily change that).

I'd love to see examples of how far this type of cloaking could be taken - making CSS files hard for spiders to access and hard for users to interpret or adding lots of javascript to change the page based on mouse position (what do spiders hover on?), etc.

#7 Adrian

Adrian

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 5779 posts

Posted 14 March 2006 - 07:37 PM

Sticking a CSS file in it's own directory which spiders are banned from using robots.txt would be one easy way.

I imagine you could also set up .htaccess to block direct requests for a CSS file so it couldn't be looked at individually as well.
Could still be got at manually, but should make it tricky for a spider to do.

#8 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 15 March 2006 - 03:45 AM

It would be interesting to get more information on the way the search engines (only Google?) interpret CSS. My guess is that they don't really interpret it, but just look for patterns and tag those domains for a manual check (in a probably very long queue...). You could play the old css-hacks game to craft a CSS file that looks clean (when doing a simple match for "display: none;") but that still does that (eg put a line break in between, etc.).

There are some other things I have in mind but I'm going to test them first LOL.

There are lots of ways to play that game without having to resort to IP delivery...

John

#9 dyn4mik3

dyn4mik3

    Ready To Fly Member

  • Members
  • 20 posts

Posted 15 March 2006 - 07:53 PM

Layer everything:

IP delivery leads to a page with css/javascript that is blocked by robots.txt. Javscript then modifies clientside html.

Even if search engines decide to start interpreting javscript, you could start cloaking AJAX calls. Deliver different data to the AJAX call depending on user-agent/ip/ability to load images, css, etc.

Now throw in some obfuscated javscript code and I really don't see search engines detecting it.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users