Jump to content


Web Site Design, Usability, SEO & Marketing Discussion and Support


Duplicate Content Analysis

Recommended Posts

I couldn't find anything online that would let me do this, so I ended up making my own :(


Duplicate content detection through k-shingle analysis


I stumbled upon G-Mans posting on this and decided that I wanted to give this a test run.


In short, what it does is take the text from a web page, turn it into text-pieces and then compares those pieces to the pieces from other pages you specify.


The pieces are in essence the same you would use for multi-word keyword-analysis. A text like

"Mary had a little lamb"

could be cut into shingles with 3 words each like so:

"mary had a"

"had a little"

"a little lamb"


By doing that, you can compare pages and determine how unique the content on them is and easily spot related pages (or pages which use a lot of the exact same content). The larger the "shingle size" (number of words grouped), the stronger it requires full duplicates - and of course the more content is required.


Here's the tool: http://oy-oy.eu/page/shingles/


And an example with my site:



In that example you can easily spot the following:

- the page with the FAQ listing (page C) has the least amount of unique content

- it is related to the other pages with FAQ entries (pages D, E, F, G)

- the root page (page A) also has a significant amount of shared content (would make sense)


Here's another example, going to multiple sites with perhaps similar content:



Which pages are close duplicates? It's easy to tell. If we also included multiple pages from each site C and D, you'll see which one is more "unique":



One of the reasons I put that together was a comment by Google's Adam Lasnik regarding the "+30" penalty where he mentioned your site should have "unique" content :). That's nothing new, but if this is the basis for a penalty, then it would be good to be able to check your pages in a way that they might also be doing. The link to Google's patent (which covers a bit more, especially in regards to scalability) is in G-Mans posting (linked above).



Share this post

Link to post
Share on other sites

Nice looking work.


One of the issues that I see here is that duplicates are looked for at every stage in the process of crawling, indexing, and serving documents.


For instance, while crawling, a search engine might want to try to detect mirrored sites, by looking at duplicate linking structures at different domains. It may also try to understand duplications on a single site when there are different URLs with similar text because the site has multiple URLs for the same pages (or extremely similar pages). Both types of detection of duplication may look to shingles for complete pages.


I would guess at that point, there may be a preference to let many duplicates be crawled and indexed because it could be difficult to tell which page is the one that should be shown. One of the granted Google patents on duplicates, Detecting duplicate and near-duplicate files , may use the following types of signals in determining which documents to keep, and which to eliminate: "the one with best PageRank, with best trust of host, that is the most recent."


But that's not the only patent from Google that describes a process for handling duplicates. Another looks at snippets instead of pages: Detecting query-specific duplicate documents . The abstract tells us:


An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as "snippets") is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.


I presented on this topic at the recent Pubcon, along with Amanda Whatlington, Yahoo's Tim Converse, and Google's Brian White. I focused on three areas of duplicate content - Where duplicate content can be seen and possible ways to avoid it, an algorithm for detection of different URLs with similar content, and a recent patent application from Microsoft involving collapsing equivalent results in search results pages. The Microsoft patent application is worth looking at in detail.


System and method for optimizing search results through equivalent results collapsing


I wrote about it in more detail here: Microsoft Explains Duplicate Content Results Filtering


A couple of the high points. Keep in mind that this is a patent application, and may not be what they are doing. Yet is seems very possible that they are following many of the ideas described in this document.


Microsoft will store information about what it believes are duplicates, together and choose one URL to display specifically to viewers. The URL that it sends people to may be different than the one that it displays. (Remember Matt Cutts in his 301 redirect posts talking about displaying the "prettier" URL? Google might be doing something similar.)


A number of assumptions are made, and followed in the choosing of which URL to show:


.com is preferred over .net


A country or language specific preference may be gleaned from location of searcher or browser settings, and a country based version of a URL may determine the page displayed.


There's more, but the patent focuses upon the actual filtering aspect of duplication, as opposed to the detection part, which is what your shingles demonstration helps show. Here's the Microsoft patent application on shingles:


Method for duplicate detection and suppression

Share this post

Link to post
Share on other sites

Thanks for your comments, Bill. You were most definitely an encyclopedia in one of your "past lives" :P.


One thing I have been playing with in the past with regards to duplicate content is an analysis on a html block level - looking at the text in div's, p's, tables, etc separately. Of course that only works when the page is parsable on a block level: which leads to my feeling that valid html code (at least on a block-level) does make a difference :). I'm not sure if that is actually applied to the methods discussed in the various papers, it might just be ignored in order to save processing time.


I find the results very interesting, there's a lot that could be done with shingles like that.


Here's one for you :) - http://oy-oy.eu/page/shingles/report.aspx?p=zOb36hvd1K


Your mention of the use of snippets instead of shingles sounds interesting as well. I feel that I've seen that in action in some places. A similar idea would be to use shingles to determine the common content across a site (navigation, etc) -- and then subtract those shingles from all pages, leaving the remaining ones to seed the rest of the process (snippet choice / generation, keywords, etc).


Using this tool on a blog also shows you some interesting things: those who post full blog postings in the summary pages could be influencing their individual postings -- it's duplicate content (other than the comments :D) Which is worth more, the summary page (which probably has more PR) or a posting? Which is better for the visitor?



With few comments: http://oy-oy.eu/page/shingles/report.aspx?p=mDJFTdj20Q

With lots of comments: http://oy-oy.eu/page/shingles/report.aspx?p=em5z1AgHHJ



Share this post

Link to post
Share on other sites

Amazing stuff, John. Those are some great tools.


Block level, whether the VIPS method used by Microsoft, or the Visual Gap Segmentation described by Google (or IBMs decribed method of looking for "narrative" sections of a page), is a smart approach.


In Tim Converse's part of the presentation at Pubcon, he stressed that there were times when duplication was fine, such as in the reporting of news by wire services, or the use of snippets of text in the fair use quotation of materials. So, that's one obstacle that the search engines have to overcome when it comes to duplicates. Tim had a nice blog post a number of weeks back which he mentioned in his presentation - Aggregation.


You're right that processing time is an issue. I couldn't see a block level analysis being used unless it has a bigger payoff, such as helping the search engines understand and provide different values to links pointing out from a site (Microsoft's Block Level link Analysis, for instance).


Removing similar global content from across a site is an interesting approach, and while the block level approach would help with that, comparing snippets across pages would, too.


Using this tool on a blog also shows you some interesting things: those who post full blog postings in the summary pages could be influencing their individual postings -- it's duplicate content (other than the comments :D ) Which is worth more, the summary page (which probably has more PR) or a posting? Which is better for the visitor?


Agreed. That's one reason why I don't post full blog posts on the front page of my blog. I have seen Google show both the front page and the full post page in search results, with the full post page indented, when searching for a string of text that appears on both - which lends itself towards an approach that is collapsing results based upon duplication of text on a full page rather than on a snippet. I don't know if that's common across the board, or just in certain situation.


Your examples of the full posts, without and with comments, should be similar to partial posts/full posts (without any comments) where the length of the full post is significant enough for there to be limited duplication.


There's a variation of duplicate document detection in Anna Patterson's approach to phrase based indexing for Google:


Phrase-based indexing in an information retrieval system


I like a lot of what I've seen of her approach, spread out over about five different patent applications. It wouldn't surprise me if Google was looking carefully at what she has done. Unfortunately, her methods would defeat Google bombing if incorporated fully, and I'm still seeing George Bush's homepage show up for the query miserable failure.

Share this post

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now