Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

There's No % Duplicate Content Threshold


  • Please log in to reply
8 replies to this topic

#1 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 05 January 2007 - 03:16 AM

Or so Adam Lasnik says:

Question: why not build into your webmaster toolkit something like a "Duplicate Content" threshold meter.

The fact that duplicate content isn't very cut and dry for us either (e.g., it's not "if more than [x]% of words on page A match page B...") makes this a complicated prospect.


EDIT:

A few other interesting tidbits about duplicate content penalty:

As I noted in the original post, penalties in the context of duplicate content are rare. Ignoring duplicate content or just picking a canonical version is MUCH more typical.


Again, this very, very rarely triggers a penalty. I can only recall seeing penalties when a site is perceived to be particularly "empty" + redundant; e.g., a reasonable person looking at it would cry "krikey! it's all basically the same junk on every page!"


Edited by Halfdeck, 05 January 2007 - 03:22 AM.


#2 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 05 January 2007 - 03:30 AM

It doesn't mean that you can duplicate all the way, though. The more unique, the better :)

Edited by A.N.Onym, 05 January 2007 - 03:32 AM.


#3 bwelford

bwelford

    Peacekeeper Administrator

  • Admin - Top Level
  • 8995 posts

Posted 05 January 2007 - 05:53 AM

Well done for pointing that out, Halfdeck. It's good to see that Google behaves, at least in this case, in a common-sense way.

#4 Wit

Wit

    Sonic Boom Member

  • 1000 Post Club
  • 1599 posts

Posted 05 January 2007 - 06:01 AM

Phew. So we haven't been lying for the last couple of years. Unless Adam is fibbing of course (but I doubt it).

#5 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 January 2007 - 09:13 AM

Duplicate content is an interesting topic :huh:. Google's algorithms really do not seem to be so simple as to take a fixed percentage and use that as a threshold, there are always different things that come together.

I think there is one thing that many people mix up -- there's in-site duplicate content (navigation, disclaimers, etc) and there's cross-site duplicate content (content which comes from some other site). The in-site duplicates are easy to recognize and ignore, once enough of the site is crawled.

If you remove the in-site duplication from a page, you're left with the "unique" part in the page (I'll call it "site-unique content", because I feel like it :D). If nothing is left (too much duplication within the same site) then that page might get ignored or devalued.

If the site-unique part is cross-site duplication, then it would be another reason to ignore it. If too much of a site is cross-site duplicated (and if it is recognized that the content originally comes from elsewhere), then that could be a reason for a big red flag for a manual review or penalty.

Imagine the following situation:
"A friend" has a site that has - within the site-unique content - only 1% web-unique content and the rest comes from several known sites (easy to determine with a shingle-analysis), then that would look like something which is not really that useful for the general web population.

You could even go so far as to (gasp!) pass the value of that site to the sites where the content originally came from. Imagine that, you scrape ekstreme.com and when it's noticed (automatically), ekstreme.com gets the PR and other value that you built for your site. That would fit into the general comments which Adam and Matt have made: why index an affiliate site when the original content owner could be pushed instead? It could even be done automatically. With long enough shingles (I have heard something like 14 words) it would be simple to find the closest owner of the full page content (assuming it's indexed).

How's that for a thought? The original owner doesn't have to worry about duplicate content and doesn't have to fight people scraping their site, because their original content will just gain value when other people copy it. That would be a simple and scalable approach, no need to hand out penalties, the value is passed directly. The only real problem would be when the original owner does not have his content indexed yet -- ie a scraper finds great content which is not in the index yet (would be hard to find, if it's really "great"; how would you stumble upon it?)...

Does any of that make sense?

John

#6 egain

egain

    Gravity Master Member

  • Members
  • 121 posts

Posted 05 January 2007 - 11:59 AM

Vanessa Fox covered aspects of duplicate penalties during her recent interview with Rand on WebProNews I seem to remember

I put a brief overview on our blog, with a link to the full video attached (Admins change this if your not happy)

http://www.e-gain.co...ues/2006/12/07/

#7 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 06 January 2007 - 03:33 AM

I believe the interview was conducted before there was a post on the Google Webmasters Blog, though.

#8 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2902 posts

Posted 06 January 2007 - 10:40 AM

How's that for a thought? The original owner doesn't have to worry about duplicate content and doesn't have to fight people scraping their site, because their original content will just gain value when other people copy it. That would be a simple and scalable approach, no need to hand out penalties, the value is passed directly. The only real problem would be when the original owner does not have his content indexed yet -- ie a scraper finds great content which is not in the index yet (would be hard to find, if it's really "great"; how would you stumble upon it?)...

Does any of that make sense?


Makes perfect sense to me! Only other problem I can see is identifying when the search engine has detected the wrong source as original - for example, if your content is syndicated by WebProNews and the algorithm determines that WebProNews must be the original because the site is older and more authoritative, although the content is identical.

Still, there are frequently some types of time stamp: date last modified, etc., and if your _page_ is evidently older you should be fine...

Anyhow, I'm just wandering now - I think your suggestion makes perfect sense!

#9 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 06 January 2007 - 11:08 AM

Yes, it's hard to find the original owner. But ... if you look back a bit into Google's origins you'll spot a reference to COPS on http://infolab.stanford.edu/~sergey/ (3 cheers for keeping old content online) - "Copy Detection Mechanisms for Digital Documents" .... sounds a lot like what would be needed :D. There are some interesting names on those old documents.

While we're at it, who wants to join me for CS 349? Sounds interesting :). The references for CS 345a also sound interesting.

John



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users