Reply to this topicStart new topic
> Google, Manipulation, And Web Spam, new patent, paid links, and penalities

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 28 2007, 06:54 PM
It's been a while since we've talked about Web spam here.

How's this for a definition:

QUOTE
When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user’s query.


That raises the question, what is a "manipulated document?"

Google was issued a new patent yesterday titled Methods and systems for identifying manipulated articles, which includes that definition.

I wrote a long post about it yesterday at my blog, but it really deserves to be discussed.

Before even beginning to look at the patent or my post about it, I'd recommend going to the following paper, at:

http://citeseer.ist.psu.edu/213063.html

Choose the PDF version to read, and when the paper opens, scroll down to page 6, which is in a box, and is titled "trawling the web for emerging cyber-communities."

In that section, the authors describe a pattern of linking that is pretty interesting - communities can often interlink amongst themselves in ways that tend to ignore a lot of the rest of the Web, and don't point to the "Authorities" and "Hubs" that you might talk about in something like HITS. These community based interlinkings are referred to as "bipartite graphs." I think that section of that paper explains those pretty well.

Once you've read that, the newly granted Google Pattern makes more sense.

It's possible that not only do communities tend to link like that to each other, but also link-based spam may exhibit some of the same behavior - with interlinking amongst pages, and perhaps a number of links pointed into a cluster of those pages from outside, from places like guestbooks and blog comments (the patent was originally filed in 2003 - I wonder if it was more recent if it would also include blog comments with guestbook comments).

The patent tells us that clusters can be identified other ways, too. Once clusters are created, it looks at links pointing into the clusters as well as signals of "manipulation" within documents, such as machine generated text, heavy keyword repetition in meta tags, redirects, historical data about content and links and site ownership, and other information on pages and about pages.

Clusters that are identified can be examined to see if they have manipulative signals pointing towards them, and if documents within them show signs of being manipulated. For ones that reach some certain threshold, they might be manually reviewed. For others that reach a higher threshold, they might instead be marked as manipulative.

A page within one of these clusters, or a subset of the cluster, that doesn't have any manipulative signals on it, but is clearly related to pages that do, may be treated in the same way as those other pages.

That treatment may involve having rankings reduced, being removed from a search index, or losing the ability to pass on something like PageRank.

I found the patent interesting because it seems to describe some things that we may have been seeing from Google for the past few years, and it presents some interesting approaches to fighting spam.

There have been some other papers and patents over the past couple of years that also provide some interesting approaches to fighting spam, from places like the AIRWeb (Adversarial Information Retrieval on the Web) workshops which I don't think that we've talked about much. You can find papers from those by following the "proceedings" links from each. Some pretty good ones in there.

Anyone want to talk about spam? Paid links?
Offline Go to the top of the page

Emoticons Detective

Group Icon
Group: Moderators
Joined: 12-May 04
Posts: 3,199
From: Glen Ellen, Ca.
post Nov 28 2007, 07:32 PM
Yes, Forbes does this a lot on Google News page. I found another company doing it yesterday but am having a hard time finding it right now.

I'll look for it.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 28 2007, 09:16 PM
That's a terrific example, Donna.

On the ads, those kinds of interstitial advertisements are a pain, aren't they. From a usability perspective, they break the flow of your accomplishment of a task.

Imagine walking into a store, and before you can get to what you wanted to buy, a salesclerk came up to you and started telling you what you should buy. It's not a sales tactic that I'm really pleased with.

I'd call that a manipulative signal. smile.gif

Forbes also doesn't seem to link out too much to sites on the web besides other Forbes properties, and advertisements.

They have a nice little link farm at the bottoms of their pages that go to places like forbestravel.com, forbesautos.com, and sites like investopia.com that have similar link farms listed on the bottoms of their pages. So, they have one of these "bipartite graphs" going on too, where they all sort of link to each other, but rarely link out to anywhere else.

Forbes is one of the sites that did see a drop in their toolbar pagerank, recently.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 27-July 05
Posts: 2,936
post Nov 28 2007, 09:31 PM
They call those interstitials.... "Welcome Screens".... and if you do not click to the article... they redirect to the homepage in about 60 seconds.
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 28-April 03
Posts: 1,489
From: UK
post Nov 28 2007, 09:33 PM
QUOTE
When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user’s query.


What if the user was searching for pornography in the first place? That quote reads to me like they are stating any search that leads to pornography is manipulated even if they were searching for it in the first place?
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Nov 28 2007, 10:10 PM
Surprisingly Ken, that's addressed in the patent.

Towards the bottom of the patent, they explain the kinds of things that they might do to pages within a cluster if they think that it has a high manipulation signal, such as lowering rankings of those pages, or removing them from the index, or not letting them pass along pagerank.

But, check the sentences that I highlighted below:


QUOTE
A manipulation indicator can be associated with every document in a cluster or subset of the cluster determined to be manipulated.

This manipulation indicator can then be used during the retrieval and ranking phase by the search engine 120 in a variety of ways.

For example, a manipulation indicator can be used in a ranking function to lower the rank of a document.

Alternatively, a manipulation indicator can be used as an indication that the document should be removed entirely from the search results.

Additionally, a manipulation indicator can be used to treat the document differently, such as not using the document in a hyperlink structure-based ranking calculation, such as PageRank.TM. from Google, Inc.

Further, a manipulation indicator can be used depending on the query. For example, if the query relates to pornography, the manipulation indicator may not be used.

Manipulated indicators can be used in a variety of other ways during the retrieval and ranking processes.


So, someone actually looking for pornography might be able to find pages that trick you into arriving at pages that deliver pornography to you.
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 05:29 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed