Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

3 Week Old Site gets 4 Billion Pages Indexed by Google


  • Please log in to reply
17 replies to this topic

#1 phaithful

phaithful

    Light Speed Member

  • Members
  • 800 posts

Posted 18 June 2006 - 01:55 AM

You've probably all read it by now, and I find this highly fascinating.

Check out this article: http://merged.ca/mon...-by-Google.html

It basically outlines a discussion that is going on over at DP about some Google spammers using black hat techniques to get a realatively new site, out of the sand box and ranking well for long tail search terms.

The basic concept is to create subdomains and sub-subdomains, interlink all of them, and create your very own network of billions of backlinks.

It exploits Google's 1 page index rule. Basically Google sees each subdomain as a separate site, each site is entitled to having 1 page index. Multiply this by, oh.... say 3-5 billion times... and although you don't have the "quality" backlinks... you've got the "quantity".

Pretty interesting....

#2 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 June 2006 - 03:32 AM

Phaithful

We had this discussion here yesterday about the subject you talking about! Check this thread.

There were quite a few websites involved. I have kept a colection of quite a few sites over the last few weeks as this subject is of interest to me - not to exploit black hat techniques - but that one can actually learn something!

I think they have been hitting hard quite a few of the other Click Companies besides google!

I guess tail-end results 1/100 000 clicks per view possibly very high I will take a guess and say that they making about $ 6000 per day. Not too bad with a server or two and some scripts. Quite a few of the domain registars are doing the same. Do a search here for domain kiting! What a schlepp!

Yannis

Edited by yannis, 18 June 2006 - 03:59 AM.


#3 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 18 June 2006 - 05:01 AM

For once I think I understand what's going on here?. :)

So I'm sure a few PhDs can figure out how to block this fairly quickly. Still it's very unsightly to see these huge mounds of virtual junk scattered around the landscape. :)

#4 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 18 June 2006 - 06:10 AM

Hi Barry

Sure it's easy to recognize - but from what I've seen from Google it might take some time, because they'll want to integrate it into their algo instead of just adding another patch on top of it all. And getting something new into that algorithm is probably a major pain; just think of all the tests they'd have to run to make sure that the one small change (blocking these kinds of sites probably won't be a minor adjustment either) doesn't bother the rest of the whole algorithm.

Of course they might just block / ban these specific sites first, but there will be lots more now that someone leaked the idea. (I wonder how that happened -- even a site of that size doesn't get noticed by the general public and then by the informed SEOs within 3 weeks.... It would be interesting to trace the origins of the information :))

Does anyone else feel this is a Matt-Cutts-Vacation-Filler? Three weeks sounds just too much of a coincidence :)

John

#5 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 June 2006 - 07:32 AM

For once I think I understand what's going on here?.



Its actually very easy to understand how the spam got past the filters. All domains or subdomains automatically get their one page indexed by google. There is no sandbox for index pages. I think sandbox actually is just a factor applied to links to let a new page climb up slowly.

PhD's to sort it out it will need more than that. The quick way is to just ban the IP. You cannot even remove the pages from the index easily. There are all sorts of domains registered. Most of these pages just carried cache's from search engines. As such how do you identify content?

This is actually one of many other current weaknesses of search engines. Search engines mostly look at 'pages'. They must start looking at websites for better results. (Although some attempts at sorting the problem out have been made).

For example say you have a website for the Battle of Waterloo 600 pages all dedicated to unique content for this topic. The SE currently see 600 different pages; expecting each and every page to get ranking links etc.
This is actually an almost impossible. How does an author attract links? You type a search for 'Battle of Waterloo' and you get ebay posts, pages with maybe a line of two for this from 'Trusted websites'. Do not give me 'trusted' websites I want relevant results. It is something like going to the library you ask if there is a book on the 'Battle of Waterloo' and the librarian gives you a book with a half a page poem about the topic written by the trusted poet of the day!

It's a long way to go! Enjoy the ride!

Yannis

Edited by yannis, 18 June 2006 - 07:35 AM.


#6 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 18 June 2006 - 11:02 AM

While Google is working hard to make quality links count, and yet discount pure quantity, it has been known for some time that enough brute force linking still worked.

This is old-style redirecting doorway pages on the grand scale - an industrial grand scale in fact.

Is Google broken? No. Go back just 6-7 years and these pages would not have surprised you - spammy redirecting doorway pages like these were everywhere. You expected to see some of these in virtually any SERP before Google. The fact that so many people are shocked and amazed at these is itself testament to how far the search engine algorithms have come.

You used to be able to do this exact thing with just a few hundred pages in 1999, not the millions, with complex sub-sub-domain setups, it now takes.

However, this does all show a problem still with Google's canonical understanding. Is it the sub-domain of sub-domains that is baffling Google's supposed new understanding of sub-domain to domain relationships which they were so proud to add last year?

#7 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 June 2006 - 11:32 AM

This is old-style redirecting doorway pages on the grand scale - an industrial grand scale in fact.


I am not sure if the problem is redirection or the fact that Google is commited to showing at least the main page of every domain or sub-domain or sub-sub domain. This fact and the sheer volume of pages makes it easy to harvest tail end results. Also lots of users are changing their search behaviour. I for one can sometimes use up to 7 word search phrases to narrow down results or click page 20 of SERPS as sometimes good information is hidden on websites that are not well SEO'd. I agree however with the statement that the search engines have come a long way over the last seven years. A simple strategy though as the one I suggested (maybe I should develop it in my garage :) !) where they factor a rating based on the total pages of a website could reduce this problem. Any thoughts on this one?

#8 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 18 June 2006 - 11:47 AM

Well I for one very much appreciate news media that archive all their web pages almost in perpetuity. Those really big websites should be in the database. But how do you then draw the line between 'legitimate' big websites and these new mammoth junky websites?

#9 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 June 2006 - 12:03 PM

Well I for one very much appreciate news media that archive all their web pages almost in perpetuity. Those really big websites should be in the database.


Barry I also appreciate those sites and I even have paid subscriptions! sometimes I also wish they go back to listing everything for the last 100 years!

But how do you then draw the line between 'legitimate' big websites and these new mammoth junky websites?


This is actually a problem. Having watched what's happening out there for a while the spammers or maybe shall we call them adsense bounty hunters? essentially use the same approach. There are hundreds or maybe hundred of thousands of wikipedia clones with one difference. The pages listed are listed as subdomains! This way they are benefiting from Google's one page policy! Others that just had millions of pages on one domain are normally either caught by the spam filter, duplicate content filters or sandboxed for eternity.

It is really a very difficult problem to solve. I also do not believe that by using the subdomains in the above manner they are breaking Google's TOS so maybe this is a good strategy to let them be, user gets there and click on ads!

#10 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 18 June 2006 - 12:11 PM

Problems.

Firstly, the spammers use sub-domains because that's what works best right now. Previously they used separate domains. At $5 per year, getting extra domains is no obstacle at all.

So, recognizing only domains would fail to prevent the spam.

Secondly, the www prefix is itself a sub-domain. Millions of legitimate, top-quality sites would be negatively impacted, so that the ration of good content would decline while we make no negative impact on spam as noted above.

So, this would actually almost double the problem.

Thirdly, legitimate sites use sub-domains too, even aside from the www subdomain. How about mail.google.com for starters? In fact, sometimes the best content is available on subdomains of a major domain.

#11 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 18 June 2006 - 12:35 PM

Difficult but not impossible!

Some thoughts from my garage!

{Call routine when sorting pageresults NOT when Ranking pages }
{ Beta Version 1.0 Cre8siteSpamtoHellPro by Yannis }

Begin

if Page NOT subdomain proceed
else
if content=onepage do some metrics;
if metrics>0 then score else pagedisappearsfromSERPS(Subdomainname);
end;

Man! Pascal was so beautiful!

Yannis

PS Apologies for the non-girlie style of the above, but the editor is eating the indents here! :)

Edited by yannis, 18 June 2006 - 12:37 PM.


#12 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 18 June 2006 - 01:43 PM

You know, it really is very impressive nevertheless. I was just searching for a recipe for baked brie with garlic. Lo and behold at #7 is one of their web pages. The URL was 1594.c.geku8h.org/

#13 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 18 June 2006 - 03:38 PM

{Call routine when sorting pageresults NOT when Ranking pages	  }
{							Beta Version 1.0 Cre8siteSpamtoHellPro by Yannis }
										   
	 Begin
		  
		if Page NOT subdomain proceed 
		   else
			  if content=onepage  do some metrics;
			  if metrics>0 then score else pagedisappearsfromSERPS(Subdomainname);
	 end;

Ah, but that's cheating.

May as well have:

if Page exists magically detect spam

No magic allowed.

What metrics are you doing in the line:

if content=onepage do some metrics;

Remember, no magic allowed. Metrics make false positives, or they let things like this through. There really isn't much alternative. Even the use of extensive failsafes in any system do not prevent failure - they simply ensure that things fail as gracefully and non-lethally as possible. The reason this spam and other spam exist, is due to failsafes in Google.

#14 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 18 June 2006 - 03:53 PM

BK, based on your experience, how fast do you think they will update their algorithms / tests to weed out these types of sites? I assume the first big dump will be based on a manual removal, but after that? Things like that are probably not best put online without very extensive tests ... will the next crop of similar sites jump into the void between the manual removal and the full algorithm change (I've already seen scripts for this floating around...)?

Actually, I'm kind of surprised that things like this do not happen on a more regular basis (or perhaps they do and I just don't notice) - Google is a top target, and if you manage to find a loophole, it would make senst to exploit it "industrially". Either they're not exploiting them (yet, or not in a large scale) or Google has gotten very sturdy.

John

#15 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 18 June 2006 - 04:57 PM

John, my experience tells me that key members of Google are working on it right now. How long it will take depends on how thorough they want to be, and how many failsafes are built in.

My bet is that this particular exploit leaves a huge footprint that will be easy to recognise, so the real time will be in determining whether many good sites would accidentally be hit by something that recognised this footprint. Make the footprint recognition too exact and a minor variation on the same spam would avoid detection. Make the footprint recognition too loose and an unacceptable level of false-positives will be found. Its finding the balance that takes the time.

I'd expect with a case this large and public for them to apply and deploy a reasonably tight footprint recognition within a week or two at most. That saves them face right now and gives breathing space to allow more time to develop a looser recognition system over the following weeks. Of course, other projects are also ongoing, and others may arise, so this will all depend on how urgent they rate this task against other tasks in the queue.

Google's algorithms are complex enough that a great many (I'd say the vast majority) of attempts to industrially exploit loopholes are failures before they even start. However, many of the major developments of the algorithms have been motivated by exploits that have worked, such as the millions of fake web directories that sprang up a couple of years ago and dominated results for many months. A lot of exploits work for a while. The trick is to not let those exploits attract attention and get pointed out to Google's quality control team.

The real black hats practitioners regard work like this as amatuerish and foolish. It closed another loophole, rather than continued to quietly use it.

#16 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9005 posts

Posted 18 June 2006 - 07:37 PM

Perhaps they've been rumbled. Doing the site search suggested by monetized, you now get
Your search - site:eiqz2q.org - did not match any documents.

#17 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 18 June 2006 - 11:14 PM

the real time will be in determining whether many good sites would accidentally be hit by something that recognised this footprint.


Not an easy task with many free member hosts signing up a client with a membername.domain.com name.

Google's algorithms are complex enough that a great many (I'd say the vast majority) of attempts to industrially exploit loopholes are failures before they even start.


Which makes me wonder if Google does what others usually don't: employ the hackers. I wonder if they have a team of people working the other way around: seeing if they can get page X ranked for term Y through loophole Z.

#18 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 19 June 2006 - 12:19 AM

I've been seeing sites like these in Google for a fairly long time. There's nothing quite so disappointing as receiving a Google Alert on a topic, only to see that it's a page like this, with a URL like this.

I'm not sure that I've seen one before that has billions of backlinks.

Which makes me wonder if Google does what others usually don't: employ the hackers. I wonder if they have a team of people working the other way around: seeing if they can get page X ranked for term Y through loophole Z.


I think that's what a number of search engineers do. <_<



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users