Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Tripping the Duplicate Content Filter


  • Please log in to reply
12 replies to this topic

#1 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 10 February 2005 - 09:47 AM

Hi All,

I've been looking at Google's duplicate text filter a little over the last few days, and I was wondering if anyone cared to share some notes over what they have experienced.

My Dilemma

I have a site on blogspot, with a page rank of five (on the toolbar), which has been around for a little over three years. It has an Atom feed as an alternate way for people to see the content of the site.

The site would come up first in Google for its title, which isn't really all that exciting. But I noticed a couple of days ago that the front page of the blog no longer appears for that search, and seems to have been replaced by a bloglines display feed of the page, which shows the first 255 characters of each of the last twenty entries.

If I search for a string of text from the beginning of a recent entry, I can get the bloglines page to show (just that page by itself), accompanied by this phrase:

In order to show you the most relevant results, we have omitted some entries very similar to the 1 already displayed. 
If you like, you can repeat the search with the omitted results included.


That includes a link which, when clicked upon, also shows a link to the front page of the blog and the single comment page for that post.

The bloglines display feed doesn't appear to have page rank (or at least it's somewhere below a one). I added the Atom feed around a year ago, so the bloglines feed is a couple of years younger than the blog.

Both are attached to popular "main" pages with the bloglines feed being part of bloglines, and the blogspot page being part of the blogger network.

Sources of Duplicate Content

I'm sure I probably missed a few, but here are some of the instances of duplicate text that I recall seeing on the web at one time or another.

[list]1. Public Domain stories and information.
2. Syndicated Columns that appear on more than one site
3. Licensed material that appears on more than one page
4. Mirrored sites
5. Plagiarized and copyright infringed content
6. RSS, RDF, and Atom Feeds
7. News Wire Stories
8. Blog Archive pages (where the blog posts are still on the front page, and are in archived pages, too.)
9. Newsletter archives which reproduce articles submitted by others
10. Fair use text
11. Manufacturer's product descriptions on retail sites
12. Editor's and producers descriptions and reviews on retail sites for books, movies, and music.[list]

Some Duplicate Content Questions and Comments

1. Except for many mirrored sites, most pages won't have exactly the same content. Usually there will be a different header, footer, and navigation system upon pages that show the same body content. But, if there is a percentage of similarity that might trigger a duplicate content filter, what would that be?

2. Might the rest of the site be considered when a filter like this is put into place? It the site is a repository of original and licensed articles, does the reproduction of those articles by the authors else where put this repository site at risk of having pages disappear from Google's index?

3. What determines why one page stays, and another disappears? I would have guess that page rank might have made a difference. Or if it didn't, maybe seniority. But my experience with the blog above seems to rule that out.

4. What do you do if you find your site has disappeared from Google, and has been replaced by a site that infringed your copyright? (OK, I know what I would do, but I figured I would add this because it does seem like something that could happen.)

5. For everyone who has syndicated some articles, and has published copies of those articles on their own sites, what has your experience with duplicate content filters been?

6. Is there a risk of a duplicate content filter being applied to ecommerce pages that use licensed descriptions from the original manufacturers, and others who sell the product using the same descriptions?

7. Are there any other issues we should consider as we explore this topic?

My Duplicate Content Resolution

I could turn off my RSS feed. Or I could send a note to Google. But I'm not sure that I want to do either of those yet.

I'm going to give it at least a few weeks to see if it will straighten itself out. I'm really not that concerned about traffic from Google on that Blog. The RSS feed to it is probably more important than whether it not the front page appears in Google. Pages other than the front page do appear in response to searches in Google.

I'm interested in hearing what others have to say. I may add more text to the page that isn't part of a blog post, to see if the addition of content will make it not appear to be a "duplicate" to the filter.

Thanks.

#2 polarmate

polarmate

    Light Speed Member

  • Members
  • 513 posts

Posted 10 February 2005 - 11:00 AM

Bill, my blog on blogspot is no longer indexed. I saw the number of pages drop slowly. And now while Google 'knows of' the URL(s), they no longer carry any description nor are they cached. A link from blogwise shows when I do an inurl search and that has the Supplemental Result label.

I don't think I have duplicate content issues because there is hardly any traffic to the blog nor would anyone want to syndicate what I've written there ;) I don't think I am linking to any bad neighborhoods either.

I read on a forum that someone discovered that he found a noindex, nofollow in the meta tags that Blogger inserts. Perhaps Google is dropping blogs from their index and will come up with some kind of Blog Search. I don't know. What I do know is that my blog no longer shows in the SERPs and while it does not matter, it would be interesting to find out why.

#3 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 11 February 2005 - 02:55 AM

Hi Polarmate,

Good news for you. If it's the blog in your signature, it does appear to still be in Google's index. :)

But, it's the individual comment post pages that seem to be disappearing on you. I wonder if that's part of a duplicate filter, too, since the same content is also included on a larger archive page, and you only have one post per month - so the content is almost exactly the same from monthly archive page to individual post page.

I liked your post on Glass instruments, by the way. This was one of the more interesting things that Benjamin Franklin invented.

I did check, and didn't see a noindex, nofollow meta tag inserted in my blogspot blog. I don't think that Google would do that, and I didn't see one. But thanks for the warning.

I've decided that I should contact Google to see if they could do something about the blog, but I'm still interested in the concept of duplicate content, and how Google treats it.

There are some interesting threads and articles on the subject of duplicate content. I'm going to point to some now, and see if I can find some more later (It's getting late).

For instance, Shari Thurow's article: Duplicate Content In The Search Engines overs the situation when people try to use more than one domain name for the same site - to capture when people try to type in a company name into a browser address bar. She recommends using a 301 redirect from the secondary domain name to the primary site.

Another article, from problogger.com, covers when a site attempts to use RSS feeds to aggregate content on a specific topic from more than one blog, and how that can harm traffic to the original blog - see: RSS Abuse, Duplicate Content and Parasite Websites

A Wilson Web article quotes some thoughts from Mike Grehan, on content that is shared by different divisions of the same company, and reproduced on their web sites. See: Reusing Web Content without Getting Penalized Mike provides some interesting observations there on legitimate reasons for companies sharing articles, and offers a couple of good suggestions, but ultimately suggests the idea of using a robots.txt file for duplicated content to avoid potential penalties.

This similar page checker was interesting:

http://www.webconfs....age-checker.php

I ran the homepage of my blog against the bloglines RSS feed, and it told me that the bloglines RSS display was 26.132045088567% percentage similar to the blog's homepage.

Google's patent on Detecting duplicate and near-duplicate files makes for some good reading on the subject. Here's a snippet that I found interesting:

In the context of a search engine, the present invention may also be used during a crawling operation to speed up the crawling and to save bandwidth by not crawling near-duplicate Web pages or sites, as determined from documents uncovered in a previous crawl. Further, by reducing the number of Web pages or sites crawled, the present invention can be used to reduce storage requirements of downstream stored data structures. The present invention may also be used after the crawl such that if more than one document are near duplicates, then only one is indexed. The present invention can instead be used later, in response to a query, in which case a user is not annoyed with near-duplicate search results. The present invention may also be used to "fix" broken links. That is, if a document (e.g., a Web page) doesn't exist (at a particular location or URL) anymore, a link to a near-duplicate page can be provided.



#4 Grumpus

Grumpus

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 6327 posts

Posted 11 February 2005 - 04:55 AM

1. Except for many mirrored sites, most pages won't have exactly the same content. Usually there will be a different header, footer, and navigation system upon pages that show the same body content. But, if there is a percentage of similarity that might trigger a duplicate content filter, what would that be?


Google seems to like to break pages down into chunks. If your search term appears wholly or mostly in your navigation, header, or other common elements of your site, then the duplicate content filter can kick in at very low percentages - maybe even 10%.

Before going too deeply into the other questions, though, I'm going to suggest that your problem, Bill, may be something else, entirely. Let's start with this...

What comes up when you put in your blog name and a keyword from a specific article from your site rather than just the name alone?

G.

#5 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 11 February 2005 - 08:01 AM

Hi G

Thanks. I'm interested in hearing your thoughts on this.

The duplicated content is page title, tagline, post titles, authors name, post date, and the first 255 characters of each post. That's the type of information that the RSS feed carries.

The front page of the blog carries the posts from the last seven days of posting, and the bloglines feed carries the first parts of the last twenty posts.

That was worth trying. I did get a slightly different result.

I tried the title of the blog, and a person's name that shows up within the first 255 characters of a post.

The bloglines page comes up first, and the blog itself comes up 31st.

If I put quotation marks around the blog title, and try the search, the blog comes in third, with the bloglines feed first, and a different bloglines feed coming in second - one that has the blog's title in its list of blogs and the person's name in the body of a post.

Also, if I put just the blog's title in quotation marks, there's a indented bloglines feed that has the blog's name in it showing up second. The blog itself shows up about 150th or so in the results now.

#6 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9018 posts

Posted 11 February 2005 - 08:07 AM

This is fascinating. Just to make sure that I'm staying up with the discussion, I'll like to ask a question or two.

1. So duplicate content doesn't kick entries out of the Google database. They're still in there somewhere. Is that true?

2. Google has chosen to put summaries ahead of the originals, I guess in its search for relevancy. Is that true?

If both answers are Yes, aren't we into a discussion of why Google would have this particular definition of Relevancy that puts a summary ahead of the original.

Or have I got this all wrong. :?

#7 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 11 February 2005 - 09:21 AM

1. So duplicate content doesn't kick entries out of the Google database. They're still in there somewhere. Is that true?


It appears that duplicate content does remain in Google's database, but may not show up in the results, or is given a discounted value in the results.

We're not sure that is the problem with my particular instance, but it looks like it might be.

2. Google has chosen to put summaries ahead of the originals, I guess in its search for relevancy. Is that true?


That's the easy conclusion to jump to, and I don't know if it is correct. That's part of the reason why I want to explore different aspects of the way that Google handles duplicate content in different situations.

I would hope that it isn't true. But, it is an aspect of the handling of duplicate content that I now seem to have a personal stake in. I don't want to turn off my Atom feed, yet the use of it appears to be causing me to have either a penalty, or (previously) to be filtered out completely in Google's results.

If I could, I would switch my Atom feed to display titles only, to see if that made things better. Unfortunately, Blogger doesn't offer that option.

I did send a quick note to Google last night about this, but I don't know if I will get a response. I may try a longer, more detailed one this weekend.

#8 Grumpus

Grumpus

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 6327 posts

Posted 11 February 2005 - 09:26 AM

1. So duplicate content doesn't kick entries out of the Google database. They're still in there somewhere. Is that true?


Nope. Never has. It shows the most relevant and filters out the other one. Relevance varies from term to term so one may show with one search, and another may show for another.

Bill - I'm gonna suggest that this isn't so much a duplicate content filter (though I guess that's probably a part of it). Rather, the bloglines site is ranking better because of its authority status. In this thread, I discuss the importance of outbound links, but I also talk about what I've recently begun calling the 1-Click-Removed rule.

Basically, the 1CRR states that Google assumes that you are searching for a specific thing but it can't always determine what that specific thing is by your search term. So, it'll prefer to take you to a page that answers every (or the most possible) potential interpretation of the term.

Bill, in your case, your site has probably 5-10 articles on the front page. To get to article number 14 from your site, you'd need to make two clicks - first click to the Archive listing and second click to the actual article.

By going to the bloglines page, you can get to article 14 with a single click (since 20 articles along with some juicy text appears on the page).

Bloglines gets another boost for being an authority site (lots of inbound and outboudn links) and likely it has "hub" power, too. So, even if their listing was identical to your page, it'd likely rank higher.

So, because the search term isn't specific enough to bring up a specific article, Google has decided that it's best to take you to the bloglines. (Okay, I just went to your site to use some hard numbers). You have six articles on your front page. The bloglines page has twenty. Thus, the bloglines page is roughly 3.3 times more likely to take the person to the article they were really looking for with but a single additional click than if they were led to your blog.

The concept of this has been around for a while - I remember first mentioning seeing it over a year ago, but it's only recently that they really tricked up the weighting of it.

Does that make sense?

G.

#9 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 11 February 2005 - 09:35 AM

Almost.

Thanks for putting forth that reasonable explanation.

It explains why the bloglines feed might rank higher than the page with a much higher page rank.

But it doesn't account for why the blog disappears so far back in the results. the Bloglines feed appears to get a boost, but the blog seems to get a penalty, too.

The solution does lie within my hands. That's to turn off the Atom feed. But the problem appears to be one of Google's making.

#10 Grumpus

Grumpus

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 6327 posts

Posted 11 February 2005 - 09:54 AM

The dupe content filter probably plays a role in this - the listing that you see for your site that's deeper in the SERP's is probably the SECOND most relevant page from your site.

Also - they keys here are "Hub" and "Authority". It appears that in this new update, Google cranked up the value of these sites (probably too much so and we'll see them back off a bit in a while).

Remember too, that when Google starts to really employ something new (i.e. Going beyond just sprinkling it into the algo) it almost always tends to crank it up really high and then back down off it over the course of a few weeks or a month. I suspect that the primary reason for this is that it's easier to see what's wrong with something than it is to see what's right with something. So, it's easier to crank it up and fix what's wrong than it is to start with it as a low level factor and tweak the things that are working in an upwards direction.

G.

#11 polarmate

polarmate

    Light Speed Member

  • Members
  • 513 posts

Posted 16 February 2005 - 10:38 PM

Just catching up with this thread and trying to absorb the ideas presented thus far.

Bill, I was talking about my food blog: http://indianfoodrocks.blogspot.com - all pages were dropped ie an inurl query showed only the URLs, no cache and no snippet. The blog was on the first page for "Indian Food Rocks" and dropped out of sight. About 5 pages are back in the index now. Checking with the Google API shows that the blog still does not rank; however a manual query in Google shows that it's back at #9.

I have the 'individual post' page as well as the monthly Archives. Considering that I am not as active as I would like to be, I probably should not link to the monthly archives no rinclude the previous posts' list as I link to my recipes directly too. I should probably also reduce the number of posts that show on the homepage to see if this makes any difference. I am not convinced it was a dupe filter in my case. If it was, at least one of the pages would have shown up. All pages would not have been dropped. It also seems strange that Google would take this action because I used publishing options provided in Blogger.

I think what happened to my blog is not what is happening to yours.

#12 BillSlawski

BillSlawski

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15644 posts

Posted 18 February 2005 - 02:04 AM

I did check, and didn't see a noindex, nofollow meta tag inserted in my blogspot blog. I don't think that Google would do that, and I didn't see one. But thanks for the warning.


They changed comments a few days ago, and instead of using a redirect on all links in comments, they now use the nofollow value to define the rel attribute for anchor elements. Maybe that's what the person who mentioned "nofollow" was talking about?

Some great recipes, polarmate. A friend who I used to work with brought Indian food in for lunch everyday, and used to share a little sometimes. Good stuff. ;)

I'm going to have to try out that Spicy Jeera Chicken.

There is something odd going on with your blog, too. I'm not sure it's completely unrelated. That they are both on blogspot is interesting.

If it is a duplicate filter problem, it may be an aspect of the problem that they didn't anticipate, or didn't think was enough of a risk to avoid potential problems with blogspot blogs and the duplication that happens in front page blogposts, archives pages, and individual post comments pages. Since the material is published in three places, and since posting frequency may be fairly low, those pages could share a lot of duplicate content.

I don't know if that is what you are experiencing, but I think I'll probably be digging into the subject in a lot more detail.

We are also discussing duplicate content here:

http://www.cre8asite...p=114573#114573

It will be interesting to see what Google does with my blog.

#13 polarmate

polarmate

    Light Speed Member

  • Members
  • 513 posts

Posted 18 February 2005 - 02:54 AM

Thank you for the kind words!! Jeera Chicken is what we're having for dinner tonight. It's my husband's birthday!

I'll follow through to the other thread and see if I can mine it for more insight into this.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users