Combating Web Spam with TrustRank
The basic idea underlying the patent is that nonspam pages tend to link to nonspam pages, and that can be helpful in identifying trustworthy pages. The abstract for the paper reads:
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.
A seed set of 200 pages seems pretty small when you consider how many documents are supposed to be in the different search engines indices.
The paper ends stating that trustrank by itself, or together with pagerank could be used to help sort spam sites from nonspam sites, and serve results to searchers that are more trustworthy.
Fine and good, but it doesn't describe how.
But, a Yahoo patent application from this morning does talk about how that might work:
Link-based spam detection
The abstract tells us:
A computer implemented method of ranking search hits in a search result set. The computer-implemented method includes receiving a query from a user and generating a list of hits related to the query, where each of the hits has a relevance to the query, where the hits have one or more boosting linked documents pointing to the hits, and where the boosting linked documents affect the relevance of the hits to the query. The method associates a metric to each of at least a subset of the hits, the metric being representative of the number of boosting linked documents that point to each of at least a subset of the hits and which artificially inflate the relevance of the hits. The method then compares the metric, which is representative of the size of a spam farm pointing to the hit, with a threshold value, processes the list of hits to form a modified list based in part on the comparison, and transmits the modified list to the user.
Some interesting tidbits of information in the patent application:
1. There's a related patent application, which hasn't been published yet, titled: "Automatic Updating of Trust Networks in Recommender Systems." It sounds a little like a way of getting more human judgment involved in ranking sites.
2. There's a nice definition section in the document, and the definitions are good ones. Pagerank? Yahoo? What's Yahoo doing using pagerank? Chances are good that they have their own flavor of the algorithm to rank pages. Here are the definitions for pagerank and trustrank:
[0019] PageRank is a family of well known algorithms for assigning numerical weights to hyperlinked documents (or web pages or web sites) indexed by a search engine. PageRank uses link information to assign global importance scores to documents on the web. The PageRank process has been patented and is described in U.S. Pat. No. 6,285,999. The PageRank of a document is a measure of the link-based popularity of a document on the Web.
[0020] TrustRank is a link analysis technique related to PageRank. TrustRank is a method for separating reputable, good pages on the Web from web spam. TrustRank is based on the presumption that good documents on the Web seldom link to spam. TrustRank involves two steps, one of seed selection and another of score propagation. The TrustRank of a document is a measure of the likelihood that the document is a reputable (i.e., a nonspam) document.
So, do you think that Yahoo! is using this method of detecting web spam?






