2 Pages V < 1 2  
Reply to this topicStart new topic
> keyword density

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Sep 29 2006, 08:11 AM
I'm fairly convinced now that its keyword frequency, not density that "matters" (for the lack of a better word).

http://www.google.com/search?hl=en&lr=...amp;btnG=Search

Screenshot here: http://www.seo4fun.com/notes/keyword-frequency-image.html

As you can see, pages rank in order by the number of times a keyword is repeated (9x ~ 12x). Besides page size (number of words on a page), and keyword frequency, everything about those pages are identical, including PageRank.

I inflated the word count on the top-ranking page with keyword repeated 12 times to decrease keyword density. Lower density did not get in the way of the page ranking high. Based on this, it seems to me Google is ranking those pages based on frequency, not density.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 11-February 04
Posts: 5,892
From: Los Angeles, CA
post Sep 29 2006, 10:02 AM
I'm not so sure I'm convinced at your findings - at least not yet

In your four scenarios, you repeat the keyword 12, 11, 10 & 9 times and ranking follows that pattern.

The density is 3%, 18%, 17% and 16%, respectively. Is it possible that Google simply prefers lower density (rankings 2 through 4 are pretty close in density)?

It seems to me you ought to 'clone' your number 4 ranked page, then adding the keyword 4 more times. This would result in a density of 21% percent and the total number of occurences at 13. It it (new scenario #5) pops to the top, then I'd agree with you, proving it that ranking is influenced by keyword occurences and not density.

This post has been edited by Respree: Sep 29 2006, 10:03 AM
Offline Go to the top of the page

Member

Group: Members
Joined: 14-July 06
Posts: 20
post Sep 29 2006, 11:38 AM
I'm going to make use of my student loan payment today and put that library/info science degree to some use and hope that I do not embarrass any of my professors in the process. smile.gif

All search engines work the same. They pattern match a query term against their index of pages and then sort the pulled references according to a relevancy measure. Term frequency is a measure of relevance although it has fallen out of favor as search technology has gotten more sophisticated. If other relevance factors outweigh frequency of use, the page will not display high in the results.

In the glory days before keyword stuffing, query term frequency was a very significant factor in relevance presentation. The glory days are gone and we now reside in a world where it has some limited value. Concepts such as latent semantic indexing that associates terms with synonyms and Term Freqency/Inverse Document Frequency [tf-idf] provide a better determination of relevance to the user's query than how often a word is used on the page [IMHO, of course]. I wish that I had sat next to bragadocchio in class as I would likely be able to better explain the concepts.

This post has been edited by marianne: Sep 29 2006, 11:40 AM
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Oct 2 2006, 01:25 PM
QUOTE
It seems to me you ought to 'clone' your number 4 ranked page, then adding the keyword 4 more times. This would result in a density of 21% percent and the total number of occurences at 13. It it (new scenario #5) pops to the top, then I'd agree with you, proving it that ranking is influenced by keyword occurences and not density.


Respree, thanks for the input. I may try that.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Oct 2 2006, 08:15 PM
QUOTE
I wish that I had sat next to bragadocchio in class as I would likely be able to better explain the concepts.


A library/info science degree probably would have been fun. Been thinking about some computer science classes, but that might be just as interesting.

Frequency rather than density is probably what you are looking for here, with local frequency over global frequency (the use of the word or phrase over the rest of the documents on the web) or as marianne put it, Freqency/Inverse Document Frequency [tf-idf]

Of course, other things likely come into play, such as the affect of anchor text pointing to a page. Other signals are impactful as well, which makes attempting to reverse hack and measure the impact of frequency rates pretty much an impossible proposition.

Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Oct 3 2006, 01:36 AM
QUOTE
I'm going to make use of my student loan payment today and put that library/info science degree to some use


Marianne, my comp sci degree has got to be the worst investment in my life smile.gif

QUOTE
Of course, other things likely come into play, such as the affect of anchor text pointing to a page.


Bill, first off, I'm in no way advocating keyword spamming. In comparison to other types of off-page optimizations, I don't expect repeating keywords on a page 200 times will carry a page trying to rank for "real estate" too far. I think that was the point of Marianne's post, which I agree with: 1) Make a site crawlable 2) Create good content 3) increase visiblity. In short, don't get hung up on old "optimization" techniques.

However, let's not forget Matt Cutts' recent blog article, http://www.mattcutts.com/blog/seo-advice-w...ders-will-love/, in which he says:

QUOTE
Notice what I did with keywords. I carefully chose keywords for the title and the url (note that I used “change” in the url and “changing” in the title). The categories on my post (”How to” and “Linux”) give me a subtle way to mention Linux again, and include a couple extra ways that someone might do a search–lots of user type “how to (do what they want to do).” I thought about the words that a user would type in when looking for an answer to their question, and tried to include those words in the article. I also tried to think of a few word variations and included them where they made sense (file vs. files, bash and bashrc, Firefox and Mozilla, etc.). I’m targetting a long-tail concept where someone will be typing several words, so I’m probably in a space where on-page keywords are enough to rank pretty well.


---

QUOTE
Other signals are impactful as well, which makes attempting to reverse hack and measure the impact of frequency rates pretty much an impossible proposition.


If you're comparing two regular websites, that would be true. But in a controlled environment, where you can rule all other signals out except the one factor you're interested in, those other signals - I would argue - are not impactful.

--

QUOTE
Frequency rather than density is probably what you are looking for here, with local frequency over global frequency (the use of the word or phrase over the rest of the documents on the web) or as marianne put it, Freqency/Inverse Document Frequency [tf-idf]


Bill, I had to read that over about four times before it made sense to me smile.gif

I'd assume Inverse Document Frequency at a snapshot in time is a constant for a given query X, so that leaves TF as the only variable in the equation for any particular query. TF is determined by number of times a term appears on a page over the total number of words on a page. In that case, the total number of words on a page seems irrelevant, since I have a page with over 400 words ranking over a page with 66 words, with all else (except term frequency) being equal. In other words, tf-idf basically boils down to keyword frequency (correct me if I'm wrong).

For example, the idf for "SEO" would be 176,000,000 / 14,480,000,000.

Whatever that resolves to be is used as IDF to calculate the tf-idf value for, say, document A and B. Document A has 100 words total, with the word "SEO" appearing 3 times (TF = 3/100). Document B on the other hand has 1000 words total, with the word "SEO" appearing 10 times (TF = 10/1000).

So, the final values would be .......

(3/100) / (176,000,000 / 14,480,000,000) [Document A]
(10/1000) / (176,000,000 / 14,480,000,000) [Document B]

IDF is identical for both documents ranking for "SEO", so it can be ruled out as a constant. That leaves TF as the key variable:

3/100 VS 10/1000 (again, assume all other factors, including inbound links, for those two documents are equal).

In short, document A should rank higher than document B. However, that's not what I see happening. Document B is outranking document A. What other explanation is there except Google is not factoring in total word count? Inbound links are identical (in other words, identical off-page factors), no keyword in urls, titles, etc. The only two things different about those two pages are total word count and keyword frequency.

That's why I'm leaning toward the conclusion that keyword density is not a factor in Google's algorithm.

QUOTE
The glory days are gone and we now reside in a world where it has some limited value. Concepts such as latent semantic indexing that associates terms with synonyms...


How effective is keyword spamming? More effective than H1 tags, and almost as effective as keywords in TITLE element. If you're targeting low hanging fruits, as Matt Cutts said, that's all you really need.

LSI will remain a myth in my mind until proven otherwise.

I admit any tests done on this kind of thing is rudimentary at best, but so far what I'm seeing is less than promising:

http://www.google.com/search?hs=jpx&hl...amp;btnG=Search

Tedster over at wmw had this to say about LSI:

QUOTE
Over the past two years, every time asked a google engineer about whether they used lsi, they said they did not. Finally I got bit sharper -- lsi is a specific method. Just because they are not using that specific method (there may even be patent questions involved) doesn't mean that google is not using various forms of semantic analysis. I'd say they definitely are. They've purchased entire companies that specilize in semantics, such as Applied Semantics in 2003.


http://www.webmasterworld.com/google/3085334.htm

This post has been edited by Halfdeck: Oct 3 2006, 01:41 AM
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-December 05
Posts: 121
From: UK
post Oct 3 2006, 04:09 AM
Further to the LSI conversation that seems to have wondered into this conversation a link to Mike Grehans recent post on LSI

http://www.clickz.com/showPage.html?page=3623571
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Oct 3 2006, 09:14 AM
QUOTE
What other explanation is there except Google is not factoring in total word count?


I think that you are right, there. Are there some other things that we should consider, though?

Other things happen to remove words from total word count, like removal of stop words, stemming, tokenization.

What impact does placement and presentation have upon the importance of those words or phrases in determining relevancy?

We could also have something like a block level analysis or visual gap segmentation being used to decide that words in some sections of a page should be given more weight that others.

What other signals might impact a determination of relevancy upon a page?

We don't know what kind of semantic analysis Google might be using, if any. Here's one of the whitepapers from Applied Semantics which talks about one form of technology that they purchased:

CIRCA Technology:
Applying Meaning to Information Management
An Applied Semantics Technical White Paper
http://www.adrenalyn.com.au/circa-semantics-technology.htm



Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Oct 3 2006, 02:34 PM
Great points Bill. I admit there may be some factors that I'm completely overlooking.

QUOTE
Other things happen to remove words from total word count, like removal of stop words, stemming, tokenization.


In my case, I doubt stop words or stemming are factors, because I'm using nonsense words (e.g. "Suspendisse at ipsum non nisi varius viverra. Quisque tincidunt adipiscing") The word "at" and "non" may be dropped, I suppose, but if I add more words to a page that drop can be compensated. I don't see stemming lowering word count.

QUOTE
What impact does placement and presentation have upon the importance of those words or phrases in determining relevancy?


Both pages are the same exact format, same HTML layout. One or two paragraphs of text, no H tags. So imo that rules out presentation as a possible factor. Keyword is scattered randomly (and evenly) in throughout a sentence or a paragraph. For example:

http://64.233.187.104/search?q=cache:b26Eq...t=clnk&cd=5

Other test pages also tell me keyword placement within a sentence/paragraph doesn't seem to make a noticeable impact on ranking.

QUOTE
We could also have something like a block level analysis or visual gap segmentation being used to decide that words in some sections of a page should be given more weight that others.


I'm only dealing with one or two paragraphs of text here (no nav menus, tables, divs).

I still *believe* the major factor in play here is keyword frequency. If on-page factors were completely ineffective, I would think the search engines would be spam-free. I don't think these SERPS happen by accident:

Yahoo:

http://search.yahoo.com/search?p=brandnews...p-rd&dups=1

1. Home page
2. The "SEO Combined" page (keyword in anchor text linking to the page, keyword in url, keyword in title, and keyword repeated 3 times on the page, twice in H tags) - ranks high on all major engines.
3. Keyword repeated 11x
4. Keyword in H1 positioned top of page
5. Keyword in META keyword tag
6. Keyword in H1 (bottom of page)
7. Keyword repeated 9x
19. Keyword repeated 10x (??)

http://search.msn.com/results.aspx?q=site%...0&go=Search

MSN:

1. Keyword in TITLE
2. Keyword repeated 11x 66 words
3. SEO Combined
...
5. Home page
6. Keyword repeated 10x 57 words
7. Keyword repeated 9x 57 words
....
9. Keyword repeated 12x 446 words

Google:

1. SEO Combined
2. Home page
3. Keyword in TITLE
4. Keyword repeated 12x
5. Keyword repeated 11x
6. Keyword repeated 10x
7. Keyword repeated 9x

Coincidence? Perhaps. Oversimplified? Likely.

Here's a recent post on ihelpyou forum by Graywolf:

QUOTE

Doug man you crack me up, that's really not keyword stuffing. Now IMHO this is an example of keyword stuffing and spamming h**p://www.wolf-howl.com/seo/aequeosalinocalcalinosetaceoaluminosocupreovitriolic/

not linked on purpose cause I know you don't like it when I do that. However even with insanely high keyword density the page still ranks.

h**p://www.google.com/search?q=Aequeosalinocalcalinosetaceoaluminosocupreovitriolic

kinda funny when stuff doesn't work the way google tells you it does isn't it


http://72.14.209.104/search?q=cache:Pziz8m...lient=firefox-a
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Oct 3 2006, 08:54 PM
I'm enjoying your experiments, halfdeck.

QUOTE
In my case, I doubt stop words or stemming are factors, because I'm using nonsense words (e.g. "Suspendisse at ipsum non nisi varius viverra. Quisque tincidunt adipiscing")


I wonder of a number of those are being filtered in some way.

In this older paper, The Term Vector Database: fast access to indexing terms for Web pages, they talk about calculating term weights, and removing some of the terms that occur least frequently on the web:

QUOTE
We eliminate the least frequent third because they are noisy and do not provide a good basis for measuring semantic similarity. For example, one such term is hte, a misspelling of the. This term appears in a handful of pages that are completely unrelated semantically. However, because this term is so infrequent, its appearance in term vectors makes those vectors appear to be quite closely related.


Might that have some meaning for your experiments? Don't know if it does, but thought it was worth pointing to.
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Oct 5 2006, 04:21 AM
QUOTE
I wonder of a number of those are being filtered in some way.


You got me there, Bill. Thanks for the input as well; you loosened my brain up a bit smile.gif
Offline Go to the top of the page

Hall of Famer

Group Icon
Group: Hall Of Fame
Joined: 3-November 05
Posts: 3,461
From: CHeeseland
post Oct 5 2006, 04:41 AM
Halfdeck, I tend to use existing english texts for most of my tests - and just sprinkle my test keywords in randomly. I pick up a few works by the same author at http://www.gutenberg.org/, use a tool to extract sentances (and clean up things that I don't need, dates, times, abbreviations, etc) and randomly mix them together to create pseudo-english pages like these. It might also prevent the filtering which Bill mentioned.

John
Offline Go to the top of the page

Centenarian Poster

Group: Members
Joined: 5-March 06
Posts: 110
post Oct 6 2006, 02:38 AM
Thanks Softplus, I might give that a try.
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Oct 8 2006, 09:55 AM
We have addressed this in a recent blog. Regarding the notion that search engines define term weights using w = tf*IDF: Why stick to this expression and assume that current search engines like Google, Yahoo or MSN use this?

For those not familiar with tf*IDF,

tf = term frequency
IDF = log(D/d)

where D is total number of documents in an index and d is number of documents containing the term in question.

Note IDF is a log scale at a given base b, where b can be 10, 2, etc. In many textbooks b is assumed to be 10, so is presented as a base-10 log scale. However, some research papers discuss binary models and use a base-2 scale. Logs are used simply because they are additive and simplify the comparison betwen large and small numbers.

Also local weights can be modified, so the well known local weight scheme, i.e., L = tf, is just one of many local weight schemes. Believe me, there are many ways of computing local weights other than just a direct mapping of the form L = tf.

In issue 2 of our IRW newsletter we mentioned that in 1999 Erica Chisholm and Tamara G. Kolda from Oak Ridge National Labs (ORNL) reviewed several term weight schemes in New Term Weighting Formulas for the Vector Space Additional weighting formulas have been published since then. All these accomodate to defining term weights as

w = L*G*N

Regarding tf*IDF as given above, this was used in the Classic Term Vector Model from the seventies and eighties. The expression has several limitations. To mention a few the formula ignores

1. negative weights
2. entropy weights (E)
3. normalization weights (N)
4. link weights

The expression ignores the relative position of terms with other terms, term ordering, contextuality, and many other things that can be incorporated into the notion of relevancy, similarity and relatedness. The equation also ignores how other documents in the collection are inducing similarity to a given document or whether such effect is negative or positive in nature.

Regarding 1, negative weights can be accounted for with for example, a probabilistic model. To mention just one

w = tf*log((D - d)/d)

This expression is also considered part of the family of tf*IDFs. In fact, in the past some authors have refered to it as IDF=log((D - d)/d), too. The expression will change from positive to negative whenever d is greater than 50% of D. This means that valid terms can have negative weights. This is purely a mathematical effect, not a particular weighting mechanism or filtering from a search engine. However, this leads to interesting retrieval and scoring (ranking) complications.

Regarding 2, entropy weights can be incorporated by adding expression of the form pln(p) where p is term probability.

Regarding 3, normalization can be incorporated by normalizing the length of documents. This can be done by taking crude ratios or by using document "pivoted" normalization. All this is discussed in the ORNL paper.

To top off, global weights can be made recursive and adaptable to a query or geolocation, so they are not just a mere constant. Relevance feedback can be incorporated in the background so the initial query can be expanded to discover and append new documents.

LSI can be triggered as ancilliary mechanism to do concept matching and expand answer sets. The final answer set can be the result of reranking and mergin previous sets (fusion) or of purging. All this can be transparent to end users seating at the end of a search box.

All this suggests that we need to think in terms of the co-retrieval power of words, rather than on mere term-to-term matching. I hope to put out a piece on this soon and to show how even in these cases c-index calculations can be used to address co-retrieval.

Cheers
Dr. Garcia

PS. I have corrected one expression and few typos.

This post has been edited by orion: Oct 8 2006, 11:10 AM
Offline Go to the top of the page

Member

Group: Members
Joined: 14-July 06
Posts: 20
post Oct 11 2006, 01:50 PM
Wowza, my Catholic school math is failing me on a lot of cognitive levels with your reply Dr. Garcia. Could you rephrase as if explaining this to a drama major? smile.gif

Many thanks.
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Oct 15 2006, 10:14 AM
I'll be happy to. Could you be a bit specific?

Meanwhile I would say this: the idea of trying to apply the IDF concept introduced by Karen Sparck-Jones back in 1972 to an IR problem from 2006 and assume that search engines are sticking to that model to score weights is contraindicated.

The fact is that the tf*IDF scheme is just one of many term weight schemes and the simplest one that incorporates local and global weights. While the IDF concept is an IR keystone concept, tf*IDF is just one way to look at weights of terms. There are many and better schemes that incorporate IDF or derivative of this, believe me. The tf*IDF primitive model is taught at CS schools as introductory material for advanced concepts since it has several limitations (some described in previous post).

Regarding the origin of the IDF concept. This is derived from Zipf Law. It was introduced for the very first time in 1972 by Sparck-Jones in the Journal of Documentation, in a paper called

A statistical interpretation of term specificity and its application in retrieval

More on this here. I highly recommend members of this forum to read the IDF Page, which is a tribute to Sparck-Jones brilliant work.

This post has been edited by orion: Oct 15 2006, 10:19 AM
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
2 Pages V < 1 2
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 06:19 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed