Dissecting The Value Of Pagerank
Posted 06 December 2006 - 05:21 AM
I agree (other than the "indicator of quality" part).
Above all, however, donít invest the majority of your time obsessing over PageRank as the end-all, be-all of tools for achieving high search rankings. PR is only an indicator of quality, and therefore is just one cog in the wheel that drives the machine of search engine optimization.
... but it's still fun to play with. I'm currently putting together a simple pagerank "toy" to let the user set up a site or a series of sites and test how pagerank is distributed. It's really interesting to see that even with an existing structure and existing inbound links you can adjust the value distribution and the total pagerank significantly.
Can a change in your internal linking make your site jump 100 places? Probably not, but like many items, if you keep it in mind when designing a new site, you can get a slight edge out of it.
Posted 06 December 2006 - 05:28 AM
"One of the major PageRank killers is duplicate content. If your siteís content is mirrored in a good number of other online sources, your PR will drop...."
It should read
"One of the major killers of search engine performance is putting a website together on holiday homes with 150,000 pages of templated content with pages which are essentially identical."
I dont know whether anyone looked at the Website Hospital recently, but a guy posted a 10,000 page website with only about 40 genuine PDA products.
Duplicate content on that website was not punished on Google as far as I could see. But it should be.
Cross correlating pages on the same site to determine similarity is not hard.
Cross correlating pages on different sites to determine similarity is almost a technical impossibility.
The mathematical calculation would be absurdly large. If we took everybodies pages, and ran a cross correlation of that page with every other page on the planet, you would be there until 2067. Then you have to move one page forward to Page 2, and then compare that to every other page on the planet, and Page 3, and so on all the way to Page 999,999,999,999,999,999,999
And in the results, if you did find similar pages, who suffers ? Who gets the penalty ?
It does not make sense. A lot of content on the web needs to be duplicated:
> stock market announcements,
> political statements,
> policies and speeches,
> product descriptions
> rules and regulations of law
> privacy policies ( every website developer on the planet just steals someone elses )
so I dont really buy it.
Duplicate content filters may be good for determining similarity on the same site, but they either have not been fully implemented, or the penalties are too light to notice.
And why does everyone in the UK own a holiday home rental website ?
Edited by travis, 06 December 2006 - 05:30 AM.
Posted 06 December 2006 - 05:56 AM
I assume the issue with duplicate content with regards to pagerank is fairly simple: there are only a finite number of links available which pass value. If those links are split among more URLs (within your site or outside, it doesn't really matter) then each link will carry less value, hence each URL will have less pagerank. Google will filter based on duplicate content (cross domain as well, though it's not as simple) and often uses pagerank as a measure to find the "best" version. (Bill will probably write a 10 page response with many great links and I bet I'm getting most of it wrong )
Imagine the following versions:
There is 100 LV ("linkvalue", my made up link currency) available for your content. You could:
A. put it on a single URL, then that URL will inherit the full LV
B. put it on 100 URLs, then each URL will inherit 1/100th of the full LV (best case)
Now, assume you want your content to rank. In order to get into the top 10, you might need at least 30 LV (some competition, the #10 item has 29 LV). In the first case, you'll easily make it with 100LV, in the second case your URL only has 1 LV and you won't make the top 10. Of course it's not always that simple in real life, but I hope it shows a bit how duplicate content can influence ranking and pagerank.
Duplicate content can take a lot of different forms and yes, they can analyze the "whole web" for it in a relatively short time, they have some neat algorithms for that. I took one idea and turned it into a simple online tool (see thread and example). The algorithms described in the patents are generally a level higher and more optimized for scalability. I bet Google has a internal tool that show the "uniqueness" of a domain with a simple scale (or color codes). I wonder how long it will be until we see that in the webmaster console
Posted 06 December 2006 - 07:57 AM
Actually, at least on Google, the duplicate content filters work quite well.
I am not going to disagree, but I need to see some proof.
Where is the evidence for these comments ?
I bet Google has a internal tool that show the "uniqueness" of a domain with a simple scale (or color codes)
How is a webmaster tool going to compare every website on the planet ?
We need to base our discussion in facts and data, rather than dreamy statements about what could possibly be.
Posted 06 December 2006 - 11:39 AM
I don't think it's possible to "know" what internal tools they use at Google, however their patents do show that it certainly is possible and even when playing with it on a small scale it is possible to see how it could be done without having to throw too many resources at it.
Just an example, the shingle algorithim (I'll just stick to that since I have played with it):
- Google has already crawled all the pages it needs, it has the contents in it's own cache (I think we can assume that this is true, though it's never finished)
- Taking those pages and extracting "shingles" with say 15 consequtive words and keeping them as a "hash" only is fairly trivial. (let me know if you need more on that part)
- Assuming the average page has 2kb of text on it, that might be 350 words or about the same number "shingles", that could mean 1.4kb of hash-data per page, which is still much less than the cached page would take. In the patents they mention further ways to reduce that by a large factor, if I remember correctly.
- Once they have that for all their stock on pages, it's simple to test for uniqueness: Take the hash-values for the shingles on the page, and see how many of them are already known. It's quick and easy, no need to compare 1 page versus all the others -- just check to see how many of the shingles are known, the others are unique for the page.
I'll have to check Mr Duplicate Content's website for some links later tonight ....
Posted 06 December 2006 - 07:16 PM
10 points for putting up some sort of explanation.
On nearly every occasion when you ask an SEO for some factual evidence, you get this sound
The main issue I have is the lack of application or the severity of the penalty.
You only need fear duplicate content if it resides on the same site. We have seen two websites in recent times with 10s of 1,000s of pages where this could be applied, but does not appear to be.
THere are too many instances on websites where the same content will just be re-written exaclty the same as another website, and quite legitimately in some circumstances.
Comparing duplicate content between link neighbours would probably be the first logical step.
How many times fo you see someone make a mobile phone ringtone website, and then make another one right next to it ?
Posted 06 December 2006 - 07:57 PM
Google doesn't apply any special penalty on in-site duplicate content, they just filter it out of the search results. Forums are notorious in that regard: they'll get the same content indexed 100's of times.
Vanessa Fox just covered in-site duplicate content in an interview yesterday, it might be interesting (though it just more or less says the same thing ): http://videos.webpro...oogle-sitemaps/
Out of site duplicate content is probably something completely different and could be a basis for a "penalty" or at least a dose of "reduced crawler priority". Google is fairly good at finding the original version of content (provided it's not an article that was distributed and published by 100's of sites at the same time). I have seen a few sites which had problems getting indexed apparently because of that (the site gets crawled but not indexed: Google want's to read the content but then decides not to publish it in the index).
Google's Adam Lasnik (who btw just got a discount-code named after him ) mentions it from time to time with regards to certain sites trying to get indexed:
It's not clear to me whether you have original content or not on this
site. At least from my brief look, it seems that most snippets I've
checked out are from other sites. Are you aggregating content with
I think it makes a lot of sense to differentiate between in-site and cross-site duplicate content. In-site duplicates are a fact of life and almost any webserver will have it (eg http://domain.com/?hahaha vs http://domain.com/ ). It certainly makes sense to get it reduced to a minimum - to try to concentrate the pagerank (and optimize the search engine listings), but it generally won't kill your site.
Cross-site duplicate content is something else... if Google notices that a whole domain already exists online, it would make sense to ignore the new copy and just keep the original indexed, right? (keep in mind that Google's founders also worked on content ownership recognition before working on the search engine ).
Reply to this topic
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users