Nice looking work.
One of the issues that I see here is that duplicates are looked for at every stage in the process of crawling, indexing, and serving documents.
For instance, while crawling, a search engine might want to try to detect mirrored sites, by looking at duplicate linking structures at different domains. It may also try to understand duplications on a single site when there are different URLs with similar text because the site has multiple URLs for the same pages (or extremely similar pages). Both types of detection of duplication may look to shingles for complete pages.
I would guess at that point, there may be a preference to let many duplicates be crawled and indexed because it could be difficult to tell which page is the one that should be shown. One of the granted Google patents on duplicates,
Detecting duplicate and near-duplicate files , may use the following types of signals in determining which documents to keep, and which to eliminate: "the one with best PageRank, with best trust of host, that is the most recent."
But that's not the only patent from Google that describes a process for handling duplicates. Another looks at snippets instead of pages:
Detecting query-specific duplicate documents . The abstract tells us:
An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as "snippets") is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity.
I presented on this topic at the recent Pubcon, along with Amanda Whatlington, Yahoo's Tim Converse, and Google's Brian White. I focused on three areas of duplicate content - Where duplicate content can be seen and possible ways to avoid it, an algorithm for detection of different URLs with similar content, and a recent patent application from Microsoft involving collapsing equivalent results in search results pages. The Microsoft patent application is worth looking at in detail.
System and method for optimizing search results through equivalent results collapsing I wrote about it in more detail here:
Microsoft Explains Duplicate Content Results FilteringA couple of the high points. Keep in mind that this is a patent application, and may not be what they are doing. Yet is seems very possible that they are following many of the ideas described in this document.
Microsoft will store information about what it believes are duplicates, together and choose one URL to display specifically to viewers. The URL that it sends people to may be different than the one that it displays. (Remember Matt Cutts in his 301 redirect posts talking about displaying the "prettier" URL? Google might be doing something similar.)
A number of assumptions are made, and followed in the choosing of which URL to show:
.com is preferred over .net
A country or language specific preference may be gleaned from location of searcher or browser settings, and a country based version of a URL may determine the page displayed.
There's more, but the patent focuses upon the actual filtering aspect of duplication, as opposed to the detection part, which is what your shingles demonstration helps show. Here's the Microsoft patent application on shingles:
Method for duplicate detection and suppression