Reply to this topicStart new topic
> There's No % Duplicate Content Threshold

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Jan 5 2007, 03:16 AM
Or so Adam Lasnik says:

Question: why not build into your webmaster toolkit something like a "Duplicate Content" threshold meter.

QUOTE
The fact that duplicate content isn't very cut and dry for us either (e.g., it's not "if more than [x]% of words on page A match page B...") makes this a complicated prospect.


EDIT:

A few other interesting tidbits about duplicate content penalty:

QUOTE
As I noted in the original post, penalties in the context of duplicate content are rare. Ignoring duplicate content or just picking a canonical version is MUCH more typical.


QUOTE
Again, this very, very rarely triggers a penalty. I can only recall seeing penalties when a site is perceived to be particularly "empty" + redundant; e.g., a reasonable person looking at it would cry "krikey! it's all basically the same junk on every page!"


This post has been edited by Halfdeck: Jan 5 2007, 03:22 AM
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 29-December 05
Posts: 3,291
From: Novosibirsk, Russia
post Jan 5 2007, 03:30 AM
It doesn't mean that you can duplicate all the way, though. The more unique, the better smile.gif

This post has been edited by A.N.Onym: Jan 5 2007, 03:32 AM
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 6-March 03
Posts: 7,962
From: Langley, British Columbia, Canada
post Jan 5 2007, 05:53 AM
Well done for pointing that out, Halfdeck. It's good to see that Google behaves, at least in this case, in a common-sense way.
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 18-November 05
Posts: 1,392
From: GMT+1
post Jan 5 2007, 06:01 AM
Phew. So we haven't been lying for the last couple of years. Unless Adam is fibbing of course (but I doubt it).
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Jan 5 2007, 09:13 AM
Duplicate content is an interesting topic smile.gif. Google's algorithms really do not seem to be so simple as to take a fixed percentage and use that as a threshold, there are always different things that come together.

I think there is one thing that many people mix up -- there's in-site duplicate content (navigation, disclaimers, etc) and there's cross-site duplicate content (content which comes from some other site). The in-site duplicates are easy to recognize and ignore, once enough of the site is crawled.

If you remove the in-site duplication from a page, you're left with the "unique" part in the page (I'll call it "site-unique content", because I feel like it biggrin.gif). If nothing is left (too much duplication within the same site) then that page might get ignored or devalued.

If the site-unique part is cross-site duplication, then it would be another reason to ignore it. If too much of a site is cross-site duplicated (and if it is recognized that the content originally comes from elsewhere), then that could be a reason for a big red flag for a manual review or penalty.

Imagine the following situation:
"A friend" has a site that has - within the site-unique content - only 1% web-unique content and the rest comes from several known sites (easy to determine with a shingle-analysis), then that would look like something which is not really that useful for the general web population.

You could even go so far as to (gasp!) pass the value of that site to the sites where the content originally came from. Imagine that, you scrape ekstreme.com and when it's noticed (automatically), ekstreme.com gets the PR and other value that you built for your site. That would fit into the general comments which Adam and Matt have made: why index an affiliate site when the original content owner could be pushed instead? It could even be done automatically. With long enough shingles (I have heard something like 14 words) it would be simple to find the closest owner of the full page content (assuming it's indexed).

How's that for a thought? The original owner doesn't have to worry about duplicate content and doesn't have to fight people scraping their site, because their original content will just gain value when other people copy it. That would be a simple and scalable approach, no need to hand out penalties, the value is passed directly. The only real problem would be when the original owner does not have his content indexed yet -- ie a scraper finds great content which is not in the index yet (would be hard to find, if it's really "great"; how would you stumble upon it?)...

Does any of that make sense?

John
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 05
Posts: 121
From: UK
post Jan 5 2007, 11:59 AM
Vanessa Fox covered aspects of duplicate penalties during her recent interview with Rand on WebProNews I seem to remember

I put a brief overview on our blog, with a link to the full video attached (Admins change this if your not happy)

http://www.e-gain.co.uk/blog/googles-vanes...ues/2006/12/07/
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 29-December 05
Posts: 3,291
From: Novosibirsk, Russia
post Jan 6 2007, 03:33 AM
I believe the interview was conducted before there was a post on the Google Webmasters Blog, though.
Offline Go to the top of the page

Technical Administrator

Group Icon
Group: Technical Administrators
Joined: 8-March 06
Posts: 2,650
From: Minneapolis/Saint Paul, MN
post Jan 6 2007, 10:40 AM
QUOTE

How's that for a thought? The original owner doesn't have to worry about duplicate content and doesn't have to fight people scraping their site, because their original content will just gain value when other people copy it. That would be a simple and scalable approach, no need to hand out penalties, the value is passed directly. The only real problem would be when the original owner does not have his content indexed yet -- ie a scraper finds great content which is not in the index yet (would be hard to find, if it's really "great"; how would you stumble upon it?)...

Does any of that make sense?


Makes perfect sense to me! Only other problem I can see is identifying when the search engine has detected the wrong source as original - for example, if your content is syndicated by WebProNews and the algorithm determines that WebProNews must be the original because the site is older and more authoritative, although the content is identical.

Still, there are frequently some types of time stamp: date last modified, etc., and if your _page_ is evidently older you should be fine...

Anyhow, I'm just wandering now - I think your suggestion makes perfect sense!
Online Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Jan 6 2007, 11:08 AM
Yes, it's hard to find the original owner. But ... if you look back a bit into Google's origins you'll spot a reference to COPS on http://infolab.stanford.edu/~sergey/ (3 cheers for keeping old content online) - "Copy Detection Mechanisms for Digital Documents" .... sounds a lot like what would be needed smile.gif. There are some interesting names on those old documents.

While we're at it, who wants to join me for CS 349? Sounds interesting biggrin.gif. The references for CS 345a also sound interesting.

John
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 06:11 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed