![]() ![]() |
Cre8asite Tech News Reporter![]() ![]() Group: 1000 Post Club
Joined: 18-June 04
Posts: 1,541
From: Tatooine
|
Dec 6 2006, 03:45 AM |
|
|
Might start a good discussion.... Whats your thoughts on Andy Steggles comments
Dissecting the Value of PageRank |
||
| Offline | ![]() |
Hall of Famer![]() ![]() Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
|
Dec 6 2006, 05:56 AM |
|
|
Actually, at least on Google, the duplicate content filters work quite well.
I assume the issue with duplicate content with regards to pagerank is fairly simple: there are only a finite number of links available which pass value. If those links are split among more URLs (within your site or outside, it doesn't really matter) then each link will carry less value, hence each URL will have less pagerank. Google will filter based on duplicate content (cross domain as well, though it's not as simple) and often uses pagerank as a measure to find the "best" version. (Bill will probably write a 10 page response with many great links and I bet I'm getting most of it wrong Imagine the following versions: There is 100 LV ("linkvalue", my made up link currency) available for your content. You could: A. put it on a single URL, then that URL will inherit the full LV B. put it on 100 URLs, then each URL will inherit 1/100th of the full LV (best case) Now, assume you want your content to rank. In order to get into the top 10, you might need at least 30 LV (some competition, the #10 item has 29 LV). In the first case, you'll easily make it with 100LV, in the second case your URL only has 1 LV and you won't make the top 10. Of course it's not always that simple in real life, but I hope it shows a bit how duplicate content can influence ranking and pagerank. Duplicate content can take a lot of different forms and yes, they can analyze the "whole web" for it in a relatively short time, they have some neat algorithms for that. I took one idea and turned it into a simple online tool (see thread and example). The algorithms described in the patents are generally a level higher and more optimized for scalability. I bet Google has a internal tool that show the "uniqueness" of a domain with a simple scale (or color codes). I wonder how long it will be until we see that in the webmaster console John |
||
| Offline | ![]() |
Hall of Famer![]() ![]() Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
|
Dec 6 2006, 11:39 AM |
|
|
Hi Travis
I don't think it's possible to "know" what internal tools they use at Google, however their patents do show that it certainly is possible and even when playing with it on a small scale it is possible to see how it could be done without having to throw too many resources at it. Just an example, the shingle algorithim (I'll just stick to that since I have played with it): - Google has already crawled all the pages it needs, it has the contents in it's own cache (I think we can assume that this is true, though it's never finished) - Taking those pages and extracting "shingles" with say 15 consequtive words and keeping them as a "hash" only is fairly trivial. (let me know if you need more on that part) - Assuming the average page has 2kb of text on it, that might be 350 words or about the same number "shingles", that could mean 1.4kb of hash-data per page, which is still much less than the cached page would take. In the patents they mention further ways to reduce that by a large factor, if I remember correctly. - Once they have that for all their stock on pages, it's simple to test for uniqueness: Take the hash-values for the shingles on the page, and see how many of them are already known. It's quick and easy, no need to compare 1 page versus all the others -- just check to see how many of the shingles are known, the others are unique for the page. I'll have to check Mr Duplicate Content's website for some links later tonight John |
||
| Offline | ![]() |
Star Member![]() ![]() Group: 1000 Post Club
Joined: 9-January 05
Posts: 1,532
From: Perth, Western Australia
|
Dec 6 2006, 07:16 PM |
|
|
Thanks John,
10 points for putting up some sort of explanation. On nearly every occasion when you ask an SEO for some factual evidence, you get this sound The main issue I have is the lack of application or the severity of the penalty. You only need fear duplicate content if it resides on the same site. We have seen two websites in recent times with 10s of 1,000s of pages where this could be applied, but does not appear to be. THere are too many instances on websites where the same content will just be re-written exaclty the same as another website, and quite legitimately in some circumstances. Comparing duplicate content between link neighbours would probably be the first logical step. How many times fo you see someone make a mobile phone ringtone website, and then make another one right next to it ? |
||
| Offline | ![]() |
Hall of Famer![]() ![]() Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
|
Dec 6 2006, 07:57 PM |
|
|
Hi Travis
Google doesn't apply any special penalty on in-site duplicate content, they just filter it out of the search results. Forums are notorious in that regard: they'll get the same content indexed 100's of times. Vanessa Fox just covered in-site duplicate content in an interview yesterday, it might be interesting (though it just more or less says the same thing Out of site duplicate content is probably something completely different and could be a basis for a "penalty" or at least a dose of "reduced crawler priority". Google is fairly good at finding the original version of content (provided it's not an article that was distributed and published by 100's of sites at the same time). I have seen a few sites which had problems getting indexed apparently because of that (the site gets crawled but not indexed: Google want's to read the content but then decides not to publish it in the index). Google's Adam Lasnik (who btw just got a discount-code named after him QUOTE It's not clear to me whether you have original content or not on this site. At least from my brief look, it seems that most snippets I've checked out are from other sites. Are you aggregating content with permission, or...? I think it makes a lot of sense to differentiate between in-site and cross-site duplicate content. In-site duplicates are a fact of life and almost any webserver will have it (eg http://domain.com/?hahaha vs http://domain.com/ ). It certainly makes sense to get it reduced to a minimum - to try to concentrate the pagerank (and optimize the search engine listings), but it generally won't kill your site. Cross-site duplicate content is something else... if Google notices that a whole domain already exists online, it would make sense to ignore the new copy and just keep the original indexed, right? (keep in mind that Google's founders also worked on content ownership recognition before working on the search engine John |
||
| Offline | ![]() |
![]()
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 12:00 PM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |