![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 27 2005, 01:40 AM |
|
|
I'll definitely address those, Bob.
Before I dig too deeply into them, let's see if anyone else has more to add, or issues with any of the ones that I've listed. I'll start with one of them that you've indicated "takes the cake." QUOTE 9. There is a duplicate content penalty LOL. First I thought you might have forget to insert the \"not\" but after reading the \"myths\" I think you really don't believe in dup penalties. Now that one takes the cake. We know from a number of patents from Altavista, that there are ways of identifying mirrored sites, and sites that are very similar to other sites on a number of levels, and that search engines can decide not to include those pages in their index. I'm not sure that is what we are talking about when we talk about a "duplicate penalty." If it's what you are considering a duplicate penalty, I'll concede that aspect of a duplicate penalty. Method and apparatus for finding mirrored hosts by analyzing urls Method and apparatus for finding mirrored hosts by analyzing connectivity and IP addresses We also know that there are methods of indentifying pages that are very similar from Altavista and Google patents. Method for determining the resemining the resemblance of documents The abstract from this Alstavista patent: QUOTE A method for facilitating the comparison of two computerized documents. The method includes loading a first document into a random access memory (RAM), loading a second document into the RAM, reducing the first document into a first sequence of tokens, reducing the second document into a second sequence of tokens, converting the first set of tokens to a first (multi)set of shingles, converting the second set of tokens to a second (multi)set of shingles, determining a first sketch of the first (multi)set of shingles, determining a second sketch of the second (multi)set of shingles, and comparing the first sketch and the second sketch. The sketches have a fixed size, independent of the size of the documents. The resemblance of two documents is provided using a sketch of each document. The sketches may be computed fairly fast and given two sketches the resemblance of the corresponding documents can be computed in linear time in the size of the sketches. Method for identifying near duplicate pages in a hyperlinked database The abstract from this Altavista patent: QUOTE A method is described for identifying pages that are near duplicates in a linked database. In the linked database, pages can have incoming links and outgoing links. Two pages are selected, a first page and a second page. For each selected page, the number of outgoing links is determined. The two pages are marked as near duplicates based on the number of common outgoing links for the two pages Method for indexing duplicate records of information of a database The abstract for this Altavista patent: QUOTE A computer implemented method indexes duplicate information stored in records having different unique addresses in a database. A fingerprint is generated for each record, the fingerprint is a singular value derived from all of the information of the record. The fingerprint is stored in the index as a unique fingerprint if the fingerprint is different than a previously stored fingerprint of the index. A reference to the unique address of the record is stored with the fingerprint. If the fingerprint is identical to the previously stored fingerprint, then store the reference to the address of the record with the previously stored fingerprint. This last one includes the clearest statement that a page would be deleted from the index if it was a duplicate of a "master page." The difficulty there may not be determining duplicates as much as it is identifying which document is the "master page" and which is the duplicate. We don't know how closely any of the search engines are presently using the ideas in those patents from Altavista, but some of the folks who worked upon them have worked on similar documents for other search engines. For instance, Monika Henzinger has worked one of a couple of patents from Google that look mostly at duplicate content issues. Detecting duplicate and near-duplicate files The abstract from this Google Patent, filed January 24, 2001 and granted December 2, 2003: QUOTE Improved duplicate and near-duplicate detection techniques may assign a number of fingerprints to a given document by (i) extracting parts from the document, (ii) assigning the extracted parts to one or more of a predetermined number of lists, and (iii) generating a fingerprint from each of the populated lists. Two documents may be considered to be near-duplicates if any one of their fingerprints match. This document does a nice job of raising issues that weren't considered in the Altavista patents, while building upon ideas mentioned in a number of them. While it is possible that some duplicates may be removed from the database, this statement leads me to believe that what is more likely to happen is a filtering of results: QUOTE The present invention may function to generate clusters of near-duplicate documents, in which a transitive property is assumed. Each document may have an identifier for identifying a cluster with which it is associated. In this alternative, in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well, only the one deemed more likely to be relevant (e.g., by virtue of a high Page rank, being more recent, etc.) is returned. It does note that there are some benefits to not including duplicate page. But, we do see from a number of search queries where we search for a unique string of text upon a page that Google will show "very similar" pages in search results, by including a link to click upon to see those. So Google is filtering results for duplicates. Does that mean that they don't delete sites, or penalize them? How do they know which page is the original? Which page is the one that shows up when one page is filtered, and another isn't? That's probably a more important question to ask then if some pages are penalized for duplicate content. How does a search engine know which is the original, and which is the duplicate? William Pugh, one of the co-inventors on that document has a pdf presentation that I've seen appear and disappear and reappear and disappear again from the web at: http://www.cs.umd.edu/~pugh/google/Duplicates.pdf . (If you do a search for it, you can see the "HTML version" which is cached in Google. Here are a couple of lines from that document: QUOTE False Positive Rate
•0.1% seems like a pretty low false positive rate •unless you are indexing billions of web pages •Need to be very careful about deciding to discard web pages from index •Less careful about eliminating near duplicates from query results He does note that he isn't sure if Google has adopted the method described in this patent. A patent that was filed (October 6, 2000) and granted (September 2, 2003) earlier than the one above from Google is: Detecting query-specific duplicate documents The abstract from this Google patent: QUOTE An improved duplicate detection technique that uses query-relevant information to limit the portion(s) of documents to be compared for similarity is described. Before comparing two documents for similarity, the content of these documents may be condensed based on the query. In one embodiment, query-relevant information or text (also referred to as \"snippets\") is extracted from the documents and only the extracted snippets, rather than the entire documents, are compared for purposes of determining similarity. Here we have an approach that is conducting a determination the duplicate nature of documents based upon snippets being returned for queries. Under this method, pages are filtered when results are served rather than possessing penalties while sitting in an index. Here's a quote from the document that describes some of the issues addressed by this patent: QUOTE Some duplicate avoidance techniques are effected during the automated indexing operation. Similar documents can be flagged by (i) defining a similarity measure between two documents, and (ii) defining the two documents as \"duplicates\" if the similarity measure exceeds a predetermined threshold. Unfortunately, however, often duplicate information may be found in documents that are not exactly the same or even very similar. For example: (i) identical content may be presented with different formatting. (e.g., plain text versus HTML); (ii) different headers and/or footers may be prepended and/or appended, respectively, to identical content; (iii) hit counters may be appended to identical content; (iv) last modified dates may be appended, to identical content; and (v) one web site may include a copy of content found elsewhere (e.g., as a part of a compilation or aggregation of content, or simply as an insertion). Cases (ii)-(iv) are illustrated by the Venn diagrams of FIGS. 1 and 2. FIG. 1 illustrates the case where a second document merely adds a small amount of information (e.g., a counter, a footer, etc.) to a first document, whereas FIG. 2 illustrates the case where a second document slightly changes some information (e.g., a last modified date) of a first document. The present invention may be used to detect such \"duplicates\" with slight changes. Furthermore, the present invention may be used to detect duplicate content within documents that have a lot of different information, such as documents with different formatting codes or documents that aggregate or incorporate other content. Many prior techniques are not well-suited for such cases. For example, assume that documents A and B each contain basic financial information about companies. Assume further that document A has information on 50 companies, while document B has information on 100 companies, at least some of which are the same as those in document A. (For example, document B could be a later, expanded version of document A.) The Venn diagrams of FIGS. 3 and 4 illustrate such examples. When I see people writing about a "duplicate content penalty," it is often in the context of "how much duplicate content can my page have before it is considered a duplicate of another page, and given a penalty by the search engines?" If we can rely upon some of the documentation above, the answer is that some content may not be indexed at all, especially if it is a site that is mirrored. Duplicate pages may or may not be indexed, but are possibly more likely to be filted out when search results are returned. And that is why I say that a duplicate content penalty is a myth. Of course, if you can provide some information otherwise, I would be appreciative. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 27 2005, 02:22 AM |
|
|
QUOTE 7. The more links the better Confirm. Long as they are not ROS, not added more than 10-20% of yoru total a month, in or from bad nebs, I am for the more links the better. I don't think thats a myth. I often see this one spouted in ads for "SEO companies" (I'll use that term loosely) that will hook people up with thousands of links for $129/month or some similar price. Saw some advertising their SEO linking services on Amazon.com recently. Quality over quantity is the way to go here. Focus upon getting links that will bring you traffic, links from sites that may be considered authorities, links from popular sites, and popular pages, links that may lead to conversions and fulfilling the objectives of a site. Of course, I'd rather have a link from the front page of each of these sites than a few thousand from other pages: Adobe.com Apple.com Energy.gov FirstGov.gov Google.com Harvard.edu Macromedia.com NASA.gov NSF.gov NYTimes.com Real.com StatCounter.com W3.org WebStandards.org Blogger.com But, ignoring the high page ranks of those sites, I want links from pages that are relevant and that people will follow while looking forward to reading the material on the site the link points to. I do think that you should work to try to increase the number of links to your site, whether you do it by building content that people what to link to, or by some other means. But a straightforward, "More links is better" is just a myth - from a link popularity stance, and from an approach that looks at the opportunity to make a conversion and meet the objectives of the site. I'll echo the words of Eric Ward here: QUOTE Are more links better or not? No, more links are not better, unless all of them are high-quality links. Numbers aren't as important as context and relevancy. It is better to have a few links from sites that are similar in content and topic to yours, a few links from the portals, and a few links from site reviewers, than to have 1,000 links on Free For All (FFA) links pages. Since I dragged a few patents into my last post, I'll bring one into this post, too. Ranking search results by reranking the results based on local inter-connectivity Does Google resort their returned results based upon links between the pages being returned? This patent from them describes how they would do that. So, it's possible that a handful of links from pages that may show up in the results of the same query your site is showing up within may provide some value to you. If all of the links pointing to your site are from pages that have nothing to do with the topic of your site, and aren't relevant, this patent, if in use, might mean that your page could be sorted so that it is behind sites within the results that link to each other. There may be sites that share a topic with you that aren't your direct competitors and may be willing to point a link to you if you provide content that they find interesting, or may be of use to their visitors. That's an area where relevance may be more important than numbers. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Nov 27 2005, 02:54 AM |
|
|
QUOTE 2. keyword density is important, and you don't want too much or too little. (you need at least one occurrence of the word or a related word... then if you stuff you will not do well) I'll give the short answer here, with an example: This page ranks #1 out of 2,890,000,000 for the phrase "click here" (without quotation marks.) http://www.adobe.com/products/acrobat/readstep2.html Neither word appears upon the page. No occurrences at all. Of course, my answer is a shortcut, but it shows that a document can have a keyword density of 0.00% and still rank number one out of 2 billion results. This article does a nice job of discussing keyword density: The Keyword Density of Non-Sense. Keyword density is the percentage of use of the keywords within the document itself. That's different from term frequency, which is the number of times the word appears within the document. I believe that's what EGOL is pointing towards, and having the word or words appear upon the page can make it easier for the phrase to rank in search engines, though we still have instances like the Adobe one I pointed towards above. |
||
| Offline | ![]() |
Moderator/Blog Editor![]() ![]() Group: Site Admin
Joined: 18-January 05
Posts: 5,375
From: Olympia WA, USA
|
Nov 27 2005, 06:17 AM |
|
|
QUOTE(Bill) 16. High rankings are the aim of SEO Agreed. 17. Optimizing for specific keyword phrases is the goal of SEO Ranking is only a step. Targeting is only a step. High rankings without something behind them would be like putting up a Very Big Advertisement without further mission. If optimizing for specific keyword phrases was "the" goal, yoiy, the mind boggles. Is that myth where gibberish generators come from? That which begat Adsense pages with automatically generated nonsense text? Background reading - thread on Role of an SEM Elizabeth |
||
| Offline | ![]() |
Quarter Grand PosterGroup: Members
Joined: 20-March 03
Posts: 289
From: London England
|
Nov 27 2005, 06:39 AM |
|
|
QUOTE(Bragadocchio) You should place content above menus to have it \"crawled first\" Is it helpful in the slightest ~B? I saw the word "should", which I can't disagree that giving some the advice of you "should" put the content above the menus, but I read here at Cre8 it is good to get the content as high up the crawl as possible. Is that just a myth please? |
||
| Offline | ![]() |
![]()
|
|
9 Pages 1 2 3 > »
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 06:34 PM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |