I've been looking at Google's duplicate text filter a little over the last few days, and I was wondering if anyone cared to share some notes over what they have experienced.
I have a site on blogspot, with a page rank of five (on the toolbar), which has been around for a little over three years. It has an Atom feed as an alternate way for people to see the content of the site.
The site would come up first in Google for its title, which isn't really all that exciting. But I noticed a couple of days ago that the front page of the blog no longer appears for that search, and seems to have been replaced by a bloglines display feed of the page, which shows the first 255 characters of each of the last twenty entries.
If I search for a string of text from the beginning of a recent entry, I can get the bloglines page to show (just that page by itself), accompanied by this phrase:
In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed.
If you like, you can repeat the search with the omitted results included.
That includes a link which, when clicked upon, also shows a link to the front page of the blog and the single comment page for that post.
The bloglines display feed doesn't appear to have page rank (or at least it's somewhere below a one). I added the Atom feed around a year ago, so the bloglines feed is a couple of years younger than the blog.
Both are attached to popular "main" pages with the bloglines feed being part of bloglines, and the blogspot page being part of the blogger network.
Sources of Duplicate Content
I'm sure I probably missed a few, but here are some of the instances of duplicate text that I recall seeing on the web at one time or another.
[list]1. Public Domain stories and information.
2. Syndicated Columns that appear on more than one site
3. Licensed material that appears on more than one page
4. Mirrored sites
5. Plagiarized and copyright infringed content
6. RSS, RDF, and Atom Feeds
7. News Wire Stories
8. Blog Archive pages (where the blog posts are still on the front page, and are in archived pages, too.)
9. Newsletter archives which reproduce articles submitted by others
10. Fair use text
11. Manufacturer's product descriptions on retail sites
12. Editor's and producers descriptions and reviews on retail sites for books, movies, and music.[list]
Some Duplicate Content Questions and Comments
1. Except for many mirrored sites, most pages won't have exactly the same content. Usually there will be a different header, footer, and navigation system upon pages that show the same body content. But, if there is a percentage of similarity that might trigger a duplicate content filter, what would that be?
2. Might the rest of the site be considered when a filter like this is put into place? It the site is a repository of original and licensed articles, does the reproduction of those articles by the authors else where put this repository site at risk of having pages disappear from Google's index?
3. What determines why one page stays, and another disappears? I would have guess that page rank might have made a difference. Or if it didn't, maybe seniority. But my experience with the blog above seems to rule that out.
4. What do you do if you find your site has disappeared from Google, and has been replaced by a site that infringed your copyright? (OK, I know what I would do, but I figured I would add this because it does seem like something that could happen.)
5. For everyone who has syndicated some articles, and has published copies of those articles on their own sites, what has your experience with duplicate content filters been?
6. Is there a risk of a duplicate content filter being applied to ecommerce pages that use licensed descriptions from the original manufacturers, and others who sell the product using the same descriptions?
7. Are there any other issues we should consider as we explore this topic?
My Duplicate Content Resolution
I could turn off my RSS feed. Or I could send a note to Google. But I'm not sure that I want to do either of those yet.
I'm going to give it at least a few weeks to see if it will straighten itself out. I'm really not that concerned about traffic from Google on that Blog. The RSS feed to it is probably more important than whether it not the front page appears in Google. Pages other than the front page do appear in response to searches in Google.
I'm interested in hearing what others have to say. I may add more text to the page that isn't part of a blog post, to see if the addition of content will make it not appear to be a "duplicate" to the filter.