![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
May 4 2006, 07:14 AM |
|
|
One of the papers on search that I've heard mentioned a lot in the last year or two is one dealing with Trustrank:
Combating Web Spam with TrustRank The basic idea underlying the patent is that nonspam pages tend to link to nonspam pages, and that can be helpful in identifying trustworthy pages. The abstract for the paper reads: QUOTE Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. A seed set of 200 pages seems pretty small when you consider how many documents are supposed to be in the different search engines indices. The paper ends stating that trustrank by itself, or together with pagerank could be used to help sort spam sites from nonspam sites, and serve results to searchers that are more trustworthy. Fine and good, but it doesn't describe how. But, a Yahoo patent application from this morning does talk about how that might work: Link-based spam detection The abstract tells us: QUOTE A computer implemented method of ranking search hits in a search result set. The computer-implemented method includes receiving a query from a user and generating a list of hits related to the query, where each of the hits has a relevance to the query, where the hits have one or more boosting linked documents pointing to the hits, and where the boosting linked documents affect the relevance of the hits to the query. The method associates a metric to each of at least a subset of the hits, the metric being representative of the number of boosting linked documents that point to each of at least a subset of the hits and which artificially inflate the relevance of the hits. The method then compares the metric, which is representative of the size of a spam farm pointing to the hit, with a threshold value, processes the list of hits to form a modified list based in part on the comparison, and transmits the modified list to the user. Some interesting tidbits of information in the patent application: 1. There's a related patent application, which hasn't been published yet, titled: "Automatic Updating of Trust Networks in Recommender Systems." It sounds a little like a way of getting more human judgment involved in ranking sites. 2. There's a nice definition section in the document, and the definitions are good ones. Pagerank? Yahoo? What's Yahoo doing using pagerank? Chances are good that they have their own flavor of the algorithm to rank pages. Here are the definitions for pagerank and trustrank: QUOTE [0019] PageRank is a family of well known algorithms for assigning numerical weights to hyperlinked documents (or web pages or web sites) indexed by a search engine. PageRank uses link information to assign global importance scores to documents on the web. The PageRank process has been patented and is described in U.S. Pat. No. 6,285,999. The PageRank of a document is a measure of the link-based popularity of a document on the Web. [0020] TrustRank is a link analysis technique related to PageRank. TrustRank is a method for separating reputable, good pages on the Web from web spam. TrustRank is based on the presumption that good documents on the Web seldom link to spam. TrustRank involves two steps, one of seed selection and another of score propagation. The TrustRank of a document is a measure of the likelihood that the document is a reputable (i.e., a nonspam) document. So, do you think that Yahoo! is using this method of detecting web spam? |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
May 4 2006, 08:08 AM |
|
|
I should add another paper that we've talked about a little here, too.
Three of the four authors of Link Spam Detection Based on Mass Estimation have their name on the patent application, and the Mass Estimation concept is also included in the USTPO filing. Taken together, these documents mesh together well. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
May 4 2006, 02:19 PM |
|
|
QUOTE What is PageRank doing in a Yahoo! paper? That's a good question. It may not be essential for them to use pagerank, if they could use something similar but different enough not to be covered under the pagerank patent. I've been wondering that since I saw the trustrank paper, and one of the authors was listed as working at Yahoo! Off the top of my head, I thought of this one, which does some interesting things but might not work with trustrank well: Method and apparatus for ranking web page search results It does some sorting of pages on its own: QUOTE The invention accommodates external, subjective or objective judgment regarding the quality of a page in relation to it content or the number of linkages included in the page that are likely to be useful. The judgments are represented in attractor matrices to indicate desirable or "high quality" sites, while non-attractor matrices indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices. Nice observation: QUOTE Even if you can get an algorithm to create a list of 200 seed pages for every 10 million web pages, you will have a seriously long list to work through. I bet they are already working on making the algo work better, perhaps by incorporating other concept. I agree with you here. They do talk about an automated way of trying to find seed pages towards the very end of the patent application. It's done so quickly that it's hard to follow without going back to the trustrank paper and trying to compare the two, and see how they might work together. QUOTE Again on the "let humans decide" theme, only Yahoo, as far as I know, manually edits its SERPs. It seems like that's one of the core values of Yahoo, going all the way back to its days as a directory rather than a search engine. But even with a system that is completely automated, you do get some subjective choices that will influence and bias search results. See the "Method and apparatus" patent I mentioned above where he defines the decision to name a page a low quality page partially as a subjective decision: QUOTE The invention accommodates external, subjective or objective judgment regarding the quality of a page in relation to it content or the number of linkages included in the page that are likely to be useful. The judgments are represented in attractor matrices to indicate desirable or "high quality" sites, while non-attractor matrices indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices. |
||
| Offline | ![]() |
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 07:08 PM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |