How's this for a definition:
When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user’s query.
That raises the question, what is a "manipulated document?"
Google was issued a new patent yesterday titled Methods and systems for identifying manipulated articles, which includes that definition.
I wrote a long post about it yesterday at my blog, but it really deserves to be discussed.
Before even beginning to look at the patent or my post about it, I'd recommend going to the following paper, at:
http://citeseer.ist....edu/213063.html
Choose the PDF version to read, and when the paper opens, scroll down to page 6, which is in a box, and is titled "trawling the web for emerging cyber-communities."
In that section, the authors describe a pattern of linking that is pretty interesting - communities can often interlink amongst themselves in ways that tend to ignore a lot of the rest of the Web, and don't point to the "Authorities" and "Hubs" that you might talk about in something like HITS. These community based interlinkings are referred to as "bipartite graphs." I think that section of that paper explains those pretty well.
Once you've read that, the newly granted Google Pattern makes more sense.
It's possible that not only do communities tend to link like that to each other, but also link-based spam may exhibit some of the same behavior - with interlinking amongst pages, and perhaps a number of links pointed into a cluster of those pages from outside, from places like guestbooks and blog comments (the patent was originally filed in 2003 - I wonder if it was more recent if it would also include blog comments with guestbook comments).
The patent tells us that clusters can be identified other ways, too. Once clusters are created, it looks at links pointing into the clusters as well as signals of "manipulation" within documents, such as machine generated text, heavy keyword repetition in meta tags, redirects, historical data about content and links and site ownership, and other information on pages and about pages.
Clusters that are identified can be examined to see if they have manipulative signals pointing towards them, and if documents within them show signs of being manipulated. For ones that reach some certain threshold, they might be manually reviewed. For others that reach a higher threshold, they might instead be marked as manipulative.
A page within one of these clusters, or a subset of the cluster, that doesn't have any manipulative signals on it, but is clearly related to pages that do, may be treated in the same way as those other pages.
That treatment may involve having rankings reduced, being removed from a search index, or losing the ability to pass on something like PageRank.
I found the patent interesting because it seems to describe some things that we may have been seeing from Google for the past few years, and it presents some interesting approaches to fighting spam.
There have been some other papers and patents over the past couple of years that also provide some interesting approaches to fighting spam, from places like the AIRWeb (Adversarial Information Retrieval on the Web) workshops which I don't think that we've talked about much. You can find papers from those by following the "proceedings" links from each. Some pretty good ones in there.
Anyone want to talk about spam? Paid links?






