Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

Truth, Lies and PageRank Tape


  • Please log in to reply
15 replies to this topic

#1 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13454 posts

Posted 14 February 2005 - 12:28 PM

So much has changed with Google since several papers analyzing how it works and how page rank is figured first came out. Several papers were poured over by SEO's and used to help them in their work.

One paper has been thrust back into discussion because the author has noticed changes. His comments sparked a response by Michael Martinez in:

A rebuttal of Phil Craven's "Google Explained"

Michael Martinez writes:

What people in the SEO community have now convinced themselves of is that a page "bleeds" PageRank, which is utter nonsense.


Remember, Google wants to approximate user behavior. They don't just want to create an algorithm that is in conflict with reality.


His analysis is flawed, as are all the others he refers to, and many more that I have read. The chief problem with all these analytical papers is that they assume or arbitrate a closed system to preserve a PR average of 1.0. Google isn't doing that. The Web is an open system, not a closed system. Hence, any closed-system model will diverge from Google's practical application.

There are other significant problems with these analyses, but I will emphasize one point that Google has made on more than one occasion (in their own words): the Toolbar PageRank has no direct correlation to the link popularity PageRank, and all these analyses continue to draw upon or try to resolve to that fallacious assumption.

A site's overall importance is not gauged simply by how many pages link to it. That was a nonsense assumption that Page and Brin relied upon in their first model, but they got burned very badly by the link farmers. All subsequent determinations of "importance" have incorporated other factors.


It's a long post, but well worth reading. He covers a lot of ground. And will likely take some heat for it by the authors of the papers he's referring to.

#2 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 14 February 2005 - 01:25 PM

Some reading to do before I can say much more than "looks interesting".

It looks like there are some flaws in the criticism, which may be semantics rather than real disagreements. For one, the comment about bleeding PageRank is an area where there are commonly misunderstandings over what is meant, but not disagreement once the misunderstanding is removed. For another, there's something sounding rather like another simple misunderstanding over the 'importance' of pages on which the links appear, leading to a corresponding difference in the importance of the link itself.

#3 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 14 February 2005 - 01:52 PM

The chief problem with all these analytical papers is that they assume or arbitrate a closed system to preserve a PR average of 1.0. Google isn't doing that.

If Martinez can show that Google aren't doing that he'll be famous.

However, he seems to miss what that average is about. It is a normalization value that is absolutely fundamentally essential, and is the entire reason that the reiterative link calculations can work. The convergence of the average value of 1 is where the reiterative calculations can stop reiterating.

The value of 1 helps ensure that there is not more total link popularity than there is links to create it. It means that on average, across the web as a whole, a link is worth 1 link, and not on an endless scale where the value of one single link has no true value, thus causing all other calculations to be valueless. It would seem that Martinez is no mathematician.

#4 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 14 February 2005 - 02:10 PM

The damping factor is a terminating probability -- that is, that any given page will terminate the random surfer's trip across the Web. But notice that they suggest the damping factor need not be applied to all pages. It was, at the time this paper was first analyzed in the SEO community, assumed that no damping factor was applied to Yahoo! That assumption should, because of Google's current market share and general reputation, be transferred to Google. It should nonetheless still be assumed that Yahoo! has a minimal damping factor.

Again, there's a disconnect in understanding here.

PageRank models the 'random walk' approach, as the papers make clear. The damping factor isn't about this 'random surfer' abandoning a page, as much as it models that he'll type in a URL he knows, rather than click a link. In the original papers, they use a damping factor of 15 percent, equating precisely to a 15% chance that this random surfer will not click any link, but instead will instead cease link clicking and type in a URL he knows instead.

Brin and Page did not suggest that the damping factor need not be applied to all pages, as Mr Martinez erroneously asserts. Instead, they suggested that a numeric damping factor need not be used at all, and that one could use a page (kinda like using a comparison filter) as the damping factor, which would allow for personalization, perhaps even localization or topical-filtering.

I never heard that some people figured Yahoo had no damping factor, but if they did, I'd have ignored them as not understanding the theory or the papers properly.

#5 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 24 February 2005 - 02:01 AM

From the original Page/Brin paper, "Anatomy of a Search Engine":

2.1.1 Description of PageRank Calculation
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:  

We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:  

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))  

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.


Emphasis is mine. PageRank is a probability distribution. That means it can never average to 1 (except in a collection with only 1 document).

Let's use some round numbers to demonstrate what they are talking about. Google currently indexes over 8,000,000,000 pages, so we'll just work with 8,000,000,000.

An unadjusted, evenly distributed PageRank for any randomly selected document would start out with 1 divided by 8,000,000,000. That is an EXTREMELY SMALL number. You then process the iterations, counting all the links, applying the standard damping factor of .15 (not allowing for any arbitrary determinations of importance), and you'll never come up with an average of 1.0.

Try it with 10 documents. They all start out with a PageRank of 0.1. Assume they all link to one other document. You're still not going to get an average of 1. It's mathematically impossible.

So, no, there is no disconnect in what I say there at all. I'm just going by what Messrs. Page and Brin have to say on the subject, and they are the only published authorities. I'll take their word for it.

Of course, they also say that the damping factor "can be set between 0 and 1. We usually set it to 0.85". That was then, of course, but "usually" implies "sometimes we set it to something else". Setting the damping factor to something OTHER than .85 can adjust a document's calculated PageRank.

It still won't produce an average PageRank of 1.0 across the database.

Nowhere did I assert that the damping factor would not or need not be applied to all documents. All I have pointed out is that the damping factor is adjustable and that there is no basis for assuming that it has never been adjusted for anything.

However, Page and Brin DID say:

...And, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages....


So, in fact, the idea that a damping factor might NOT be applied to some pages comes straight from them, not from me. I call it a "terminating probability", you say something about typing in another URL. The difference in language is trivial. We are saying the same thing.

A great deal of nonsense concerning PageRank has been passed around for years. It's probably never going to be correctly understood by anyone outside of Google's staff, because only they know how it is currently implemented.

But I have seen enough analyses of PageRank which confuse the ToolBar PageRank (measured from 0 to 10) with the link popularity PageRank (which is always between 0 and 1, not inclusive of 1) to know that most of the self-styled experts who comment on it don't know what they are talking about.

There is no average of 1 to preserve or converge toward. That is mathematically impossible.

#6 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9003 posts

Posted 24 February 2005 - 05:35 AM

Welcome to the Forums, Michael_Martinez. :wave:

I have seen enough analyses of PageRank which confuse the ToolBar PageRank (measured from 0 to 10) with the link popularity PageRank (which is always between 0 and 1, not inclusive of 1)

Spot on, Michael. Of course, if not even the inventors of PageRank seem to be able to avoid this confusion, it's not surprising that the rest of us forget which we're talking about.

For myself, I think the whole PageRank concept has by now lost its usefulness as a barometer of a given web page. It's about as good as that barometer that people used to have on the wall in the entrance of the house, say back in the 1940's. You'd tap it as you left the house in the morning. If it moved up, likely you were in for good weather (anticyclone): if it moved down, the weather would likely get worse (depression). That's as useful as the ToolBar PageRank is nowadays.

Originally it was a great measure of a web page. That was provided you didn't tell anyone about it. Once the cat was out of the bag, everyone tried to add web pages and links to beat the system. So by now, it's guaranteed that a single PageRank number is not part of the Google algorithm. Back links are still important to Google (? too important) but there's no simple summary measure. I'm sure that back links are also important to MicroSoft Search and Yahoo! as well.

Now it's still a useful branding element to Google so it won't go away. However IMHO it's more like the Nike swoosh now. That gives you instant brand recognition. Beyond that, don't try to read anything into it.

#7 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9003 posts

Posted 24 February 2005 - 05:40 PM

I see Mike Grehan seems to be talking with others who say PageRank is no longer a part of the Google algorithm.

#8 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13454 posts

Posted 24 February 2005 - 08:54 PM

Welcome to the forums Michael! :wave:

I'm very happy you found us and this thread about your writing, and responded. My hat's off to you and anyone who has the patience for all that math! :)

#9 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 25 February 2005 - 01:52 AM

Thanks for the welcome. I don't want to seem like a one-hit wonder, but I am already stretched thin on forum discussions (and have had to sacrifice some non-SEO related discussions to do what I am doing now).

I have bookmarked this forum and will drop by from time to time. I'll try not to be too argumentative, as my positions on certain topics are generally well-known.

I do tend to emphasize the fundamentals of page design over everything else, not because of any belief that nothing else works, but because I feel fundamentals are not stressed enough any more.

#10 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 25 February 2005 - 06:43 AM

HI Michael,

Glad that you stopped by and shared your thoughts with us.

We'll look forward to future visits, when you have the chance.

#11 whitemark

whitemark

    Time Traveler Member

  • 1000 Post Club
  • 1071 posts

Posted 25 February 2005 - 02:21 PM

What I am unable to understand is why apart from links nobody seems to be trying to (or haven't been able to) figure out what other factors Google use to calculate PageRank. Perhaps because many of the PageRank papers floating around only seem to mention links ... ?

#12 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 26 February 2005 - 12:02 AM

By definition, PageRank only applies to the links. I think what you're referring to as PageRank is their actual search results ranking (or ordering) algorithm. The link popularity is a tool for discriminating between otherwise equal results.

However, there ARE people out there trying to figure out what else Google is using to order the results.

We are all equally ignorant on the matter, having virtually no reliable information to work with.

We know (from Google's own guidelines) that title tags, headers, and on-page content are used to determine relevance. A degree of relevance to the query is determined by the query tool and that is used to order results. If they need further ordering, then the link popularity may be (and most probably is) used.

#13 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 26 February 2005 - 12:27 AM

I should probably also point out that I was a little loose with my numbers. I attempted an iteration by hand in another discussion, using a variation of the 10 document example I suggested above. I only checked my numbers a little bit. For example, my reference to a "1 in 10 chance" is probably a little off (maybe it's closer to 1 in 20). Here is what I posted (as far as Classic PageRank is concerned):


A unique search term can be devised and used as the title for three pages (A..C). Let any group of ten other pages (D..M) link to one of the three pages. Have five pages link to page A, three pages link to page B, and two pages link to page C.

Pages A, B, and C should have unique content (say, 1 paragraph). But they should all have the same title tag, and all ten anchor texts on pages D - M should use the same keyword.

Now, all other things being equal, the link popularity should sort the pages in order A, B, C.

Once that sorting has been completed, change the content of four of the five pages which link to page A by adding a link to both page B and page C in those four pages.

After the next indexing, page A should now move down below page C.

Once this has been confirmed, add a link to page C onto the three pages linking only to page B.

After the next indexing, page C SHOULD now move up to the number 1 slot.

Here is the Classic PageRank calculation:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

C(Tn) = the number links on page Tn.

PR(Tn) starts at 1 divided by the number of documents in the collection (a probability of 1 divided evenly among all documents).

d is assumed to be 0.85 as suggested in the original PageRank paper.

Since there is no link reciprocation, we only need the first iteration of the PageRank algorithm.

P(A) = (1-0.85) * (0.85) * (5 * (1/13)).

P(B) = (1-0.85) * (0.85) * (3 * (1/13)).

P© = (1-0.85) * (0.85) * (2 * (1/13)).

According to my Windows 2000 calculator, 1/13 = 0.076923076923076923076923076923077.

Hence,

P(A) = (0.15) * (0.85) * (5 * 0.076923076923076923076923076923077)
P(A) = 0.049038461538461538461538461538413

P(B) = (0.15) * (0.85) * (3 * 0.076923076923076923076923076923077)
P(B) = 0.029423076923076923076923076923048

P© = (0.15) * (0.85) * (2 * 0.076923076923076923076923076923077)
P© = 0.019615384615384615384615384615365

P(D)..P(M) all equal 0, in case anyone is interested, so the sum of the probabilities never achieves 1 in this distribution. In fact, according to this simplified model there is only a 1 in 10 chance that a random surfer will click on any link in the collection and end up somewhere.

After adding two outbound links to four of page A's supporting pages, the PageRanks look like this:

P(B) = (0.15) * (0.85) * ((4 * (0.076923076923076923076923076923077 / 3)) +(3* 0.076923076923076923076923076923077))
P(B) = 0.042499999999999999999999999999957

P© = (0.15) * (0.85) * ((4 * (0.076923076923076923076923076923077 / 3)) + (2 * 0.076923076923076923076923076923077))
P© = 0.032692307692307692307692307692275

P(A) = (0.15) * (0.85) * (4 * ((0.076923076923076923076923076923077 / 3)) + ( 0.076923076923076923076923076923077))
P(A) = 0.022884615384615384615384615384592

Finally, we add links to page C to all three of page B's exclusive supporters:

P© = (0.15) * (0.85) * ((4 * (0.076923076923076923076923076923077 / 3)) + (3 * (0.076923076923076923076923076923077 / 2)) + (2 * 0.076923076923076923076923076923077))
P© = 0.047403846153846153846153846153735

P(B) = (0.15) * (0.85) * ((4 * (0.076923076923076923076923076923077 / 3)) +(3* (0.076923076923076923076923076923077 / 2)))
P(B) = 0.02778846153846153846153846153837

P(A) = (0.15) * (0.85) * (4 * ((0.076923076923076923076923076923077 / 3)) + ( 0.076923076923076923076923076923077))
P(A) = 0.022884615384615384615384615384592

#14 whitemark

whitemark

    Time Traveler Member

  • 1000 Post Club
  • 1071 posts

Posted 26 February 2005 - 03:41 AM

By definition, PageRank only applies to the links. I think what you're referring to as PageRank is their actual search results ranking (or ordering) algorithm.


No, I was talking about PageRank.
Google's Technology Overview page, mentions "... PageRank performs an objective measurement of the importance of web pages by solving an equation of more than 500 million variables and 2 billion terms."

What do they mean by '500 million variables' and '2 billion terms'? My first assumption when I read 2 billion terms was that it perhaps stood for the 2 billion pages in their index at that time)...

Any ideas on what these mean? (Oh, by the way, me not into maths ... :)

#15 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 27 February 2005 - 12:24 AM

In the very earliest stages, Google (or more correctly, the forerunner to BackRub) passed 100% of pagerank along through the links. This is mentioned in the feedback loop part of the original papers, as the reason that a damping factor was created. The fifteen percent damping factor quoted is the exact counterbalance to the +15% initial value they decided to give each page. That's why in the sums you'll notice that the intrinsic 15% value of the page, added to the 85% value passed through links means there is still 100% in a way.

As we all can tell, 100% = 1 whole unit of whatever it is you were calculating a percentage of. There again is that important number 1.

However, this isn't done so that pages with no links to them can still have a tiny value (15%) even without links, because as anyone who's read the papers in detail will know, orphan pages (pages that no links point to) and dead-end pages (pages with no outbound links on them) are both actually removed from the reiterative PageRank calculations entirely before the iterative process begins.

I can see that I'm going to need to quote references to show Michael just where his mistake lies, so I'll have to return to this topic later to show exactly where his oversight lies.

#16 Michael_Martinez

Michael_Martinez

    Time Traveler Member

  • 1000 Post Club
  • 1354 posts

Posted 28 February 2005 - 05:35 PM

No, I was talking about PageRank.

Google's Technology Overview page, mentions "... PageRank performs an objective measurement of the importance of web pages by solving an equation of more than 500 million variables and 2 billion terms."


It's mostly marketing gobbledy-gook, to be honest. There is insufficient information on that page for anyone really to determine anything about what is actually going on under the hood.

Some search industry insiders are now openly claiming that Google has never implemented the PageRank algorithm.

I'm not so sure that dog will hunt, though. Google proved to be too easy to manipulate through link farming, link bombing, and some other less-frequently link-based methodologies for me to believe they never implemented SOMETHING which counts links.



RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users