Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

Trustrank + Pagerank = - Spam


  • Please log in to reply
6 replies to this topic

#1 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 04 May 2006 - 07:14 AM

One of the papers on search that I've heard mentioned a lot in the last year or two is one dealing with Trustrank:

Combating Web Spam with TrustRank

The basic idea underlying the patent is that nonspam pages tend to link to nonspam pages, and that can be helpful in identifying trustworthy pages. The abstract for the paper reads:

Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine's results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.



A seed set of 200 pages seems pretty small when you consider how many documents are supposed to be in the different search engines indices.

The paper ends stating that trustrank by itself, or together with pagerank could be used to help sort spam sites from nonspam sites, and serve results to searchers that are more trustworthy.

Fine and good, but it doesn't describe how.

But, a Yahoo patent application from this morning does talk about how that might work:

Link-based spam detection

The abstract tells us:

A computer implemented method of ranking search hits in a search result set. The computer-implemented method includes receiving a query from a user and generating a list of hits related to the query, where each of the hits has a relevance to the query, where the hits have one or more boosting linked documents pointing to the hits, and where the boosting linked documents affect the relevance of the hits to the query. The method associates a metric to each of at least a subset of the hits, the metric being representative of the number of boosting linked documents that point to each of at least a subset of the hits and which artificially inflate the relevance of the hits. The method then compares the metric, which is representative of the size of a spam farm pointing to the hit, with a threshold value, processes the list of hits to form a modified list based in part on the comparison, and transmits the modified list to the user.



Some interesting tidbits of information in the patent application:

1. There's a related patent application, which hasn't been published yet, titled: "Automatic Updating of Trust Networks in Recommender Systems." It sounds a little like a way of getting more human judgment involved in ranking sites.

2. There's a nice definition section in the document, and the definitions are good ones. Pagerank? Yahoo? What's Yahoo doing using pagerank? Chances are good that they have their own flavor of the algorithm to rank pages. Here are the definitions for pagerank and trustrank:

[0019] PageRank is a family of well known algorithms for assigning numerical weights to hyperlinked documents (or web pages or web sites) indexed by a search engine. PageRank uses link information to assign global importance scores to documents on the web. The PageRank process has been patented and is described in U.S. Pat. No. 6,285,999. The PageRank of a document is a measure of the link-based popularity of a document on the Web.

[0020] TrustRank is a link analysis technique related to PageRank. TrustRank is a method for separating reputable, good pages on the Web from web spam. TrustRank is based on the presumption that good documents on the Web seldom link to spam. TrustRank involves two steps, one of seed selection and another of score propagation. The TrustRank of a document is a measure of the likelihood that the document is a reputable (i.e., a nonspam) document.



So, do you think that Yahoo! is using this method of detecting web spam?

#2 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 04 May 2006 - 08:08 AM

I should add another paper that we've talked about a little here, too.

Three of the four authors of Link Spam Detection Based on Mass Estimation have their name on the patent application, and the Mass Estimation concept is also included in the USTPO filing.

Taken together, these documents mesh together well.

#3 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 May 2006 - 08:55 AM

Interesting stuff. I have a few more comments:

1. The TrustRank idea is simple enough to understand and produces a decent algo. The algorithm is not perfect (in the example they use, spammy page 5 got a good score), but it's a start. The upshot is that they told us how to circumvent it.

2. They always go back to getting a list of pages to manually check by humans. That's fine and dandy, but as the web grows larger, the list will grow too - massively. Even if you can get an algorithm to create a list of 200 seed pages for every 10 million web pages, you will have a seriously long list to work through. I bet they are already working on making the algo work better, perhaps by incorporating other concept. One thing they mention in their introduction is URLs, and that's certainly something that can have tell-tail signs of spam.

3. Again on the "let humans decide" theme, only Yahoo, as far as I know, manually edits its SERPs. Google "prides" itself (for the lack of a better word) that all its SERPs are computer generated - they only remove pages, and don't manually boost them. That's why I wasn't surprised this was a Yahoo! project.

4. What is PageRank doing in a Yahoo! paper? They state there is a relationship between PR and TrustRank and how they perform as spam checkers.

Interesting stuff to read :D

#4 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 04 May 2006 - 02:19 PM

What is PageRank doing in a Yahoo! paper?


That's a good question.

It may not be essential for them to use pagerank, if they could use something similar but different enough not to be covered under the pagerank patent. I've been wondering that since I saw the trustrank paper, and one of the authors was listed as working at Yahoo!

Off the top of my head, I thought of this one, which does some interesting things but might not work with trustrank well:

Method and apparatus for ranking web page search results

It does some sorting of pages on its own:

The invention accommodates external, subjective or objective judgment regarding the quality of a page in relation to it content or the number of linkages included in the page that are likely to be useful. The judgments are represented in attractor matrices to indicate desirable or "high quality" sites, while non-attractor matrices indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices.




Nice observation:

Even if you can get an algorithm to create a list of 200 seed pages for every 10 million web pages, you will have a seriously long list to work through. I bet they are already working on making the algo work better, perhaps by incorporating other concept.


I agree with you here. They do talk about an automated way of trying to find seed pages towards the very end of the patent application. It's done so quickly that it's hard to follow without going back to the trustrank paper and trying to compare the two, and see how they might work together.


Again on the "let humans decide" theme, only Yahoo, as far as I know, manually edits its SERPs.



It seems like that's one of the core values of Yahoo, going all the way back to its days as a directory rather than a search engine. But even with a system that is completely automated, you do get some subjective choices that will influence and bias search results. See the "Method and apparatus" patent I mentioned above where he defines the decision to name a page a low quality page partially as a subjective decision:

The invention accommodates external, subjective or objective judgment regarding the quality of a page in relation to it content or the number of linkages included in the page that are likely to be useful. The judgments are represented in attractor matrices to indicate desirable or "high quality" sites, while non-attractor matrices indicate sites that are undesirable. Attractor matrices and non-attractor matrices can be used alone or in combination with each other in the linear combination. Additional bias toward high quality sites, or away from undesirable sites, can be further introduced with probability weighting matrices for attractor and non-attractor matrices.



#5 eKstreme

eKstreme

    Hall of Fame

  • 1000 Post Club
  • 3399 posts

Posted 04 May 2006 - 03:08 PM

I accept that Yahoo! is using a PageRank-like algo that is not covered by a patent, but my concern is that they are using a Google trademark. PageRank is a Google-specific name. Personally, I would be happier if they call it something else...

On a related note: isn't Stanford like the biggest non-corporate Google research center? Weren't the addresses in Yahoo paper from Stanford too? Conflict of interest anyone? How about cross-pollination of commercially sensitive ideas?

#6 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 04 May 2006 - 03:49 PM

Using the pagerank name, I guess is sort of essential. I mean, whatever it is that they are using that is similar to pagerank would be one of their trade secrets, and I bet they are more comfortable with keeping it that way.

There are a number of patent filings that aren't from Google that use the term pagerank. I guess one of the risks of being popular is becoming the example that others are willing to talk about.


On a related note: isn't Stanford like the biggest non-corporate Google research center?


A lot of the academic work that comes out of it seems to become somehow affiliated with Google. But, it is an independent school, and while it owns the two patents that describe pagerank, it licenses those out to Google. The professors who work there are free to work with other companies. It's probably not a bad thing to be able to teach classes, and do a little contracting with a Google or a Yahoo.

There's a footnote on the front page of the Mass Estimation paper that I linked to which notes that one of it's authors, Zoltan Gyongyi, was a summer intern at Yahoo when he worked on the paper. He's also listed on the patent, so I guess his work flowed over from both documents to the patent.

#7 bwelford

bwelford

    Eyes Like Hawk Moderator

  • Moderators
  • 8894 posts
  • Twitter:http://twitter.com/BWelford
  • Facebook:http://www.facebook.com/bwelford

Posted 05 May 2006 - 11:51 AM

I find it slightly amusing that Trustrank is also a Google trademark as of March 16, 2005. So Yahoo will have to name it something else if it becomes part of their approach.

I prefer the Yahoo term inlink to Google's back link, so there is a precedent for coming along with a better name for an earlier concept. :D

Edited by bwelford, 05 May 2006 - 11:53 AM.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users