Duplicate content is an interesting topic
. Google's algorithms really do not seem to be so simple as to take a fixed percentage and use that as a threshold, there are always different things that come together.
I think there is one thing that many people mix up -- there's in-site duplicate content (navigation, disclaimers, etc) and there's cross-site duplicate content (content which comes from some other site). The in-site duplicates are easy to recognize and ignore, once enough of the site is crawled.
If you remove the in-site duplication from a page, you're left with the "unique" part in the page (I'll call it "site-unique content", because I feel like it
). If nothing is left (too much duplication within the same site) then that page might get ignored or devalued.
If the site-unique part is cross-site duplication, then it would be another reason to ignore it. If too much of a site is cross-site duplicated (and if it is recognized that the content originally comes from elsewhere), then that could be a reason for a big red flag for a manual review or penalty.
Imagine the following situation:
"A friend" has a site that has - within the site-unique content - only 1% web-unique content and the rest comes from several known sites (easy to determine with a shingle-analysis), then that would look like something which is not really that useful for the general web population.
You could even go so far as to (gasp!) pass the value of that site to the sites where the content originally came from. Imagine that, you scrape ekstreme.com and when it's noticed (automatically), ekstreme.com gets the PR and other value that you built for your site. That would fit into the general comments which Adam and Matt have made: why index an affiliate site when the original content owner could be pushed instead? It could even be done automatically. With long enough shingles (I have heard something like 14 words) it would be simple to find the closest owner of the full page content (assuming it's indexed).
How's that for a thought? The original owner doesn't have to worry about duplicate content and doesn't have to fight people scraping their site, because their original content will just gain value when other people copy it. That would be a simple and scalable approach, no need to hand out penalties, the value is passed directly. The only real problem would be when the original owner does not have his content indexed yet -- ie a scraper finds great content which is not in the index yet (would be hard to find, if it's really "great"; how would you stumble upon it?)...
Does any of that make sense?