Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

keyword density


  • Please log in to reply
35 replies to this topic

#1 Cathy Jones

Cathy Jones

    Unlurked Energy

  • Members
  • 7 posts

Posted 12 July 2006 - 01:29 AM

What should be the exact keyword density in a website?

#2 lee.n3o

lee.n3o

    Cre8asite Tech News Reporter

  • 1000 Post Club
  • 1556 posts

Posted 12 July 2006 - 03:02 AM

Well thats a hard one... I don't think anyone caan really give a definitive answer on that... I read this the other day and think its a great bit of advice

I can't specifically give an optimal keyword density in an article as there's a good chance it will be read months from now when the densities are different. The ideal way to calculate optimal densities is to figure out what the densities of the current top 10 are and target appropriately


Hope that helps

#3 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 12 July 2006 - 04:06 AM

I am sure the topic has been covered before, but using Google didn't reveal anything solid.

Anyway, keyword density is for the machinery. You want to write content for your visitors.
That is, you write whatever you want for your visitors without taking keywords into account, then replace meaningless words (pronouns - it, that, who, etc) with your keywords. This way your website copy retains the original flow and has the right keywords.

Special attention should be paid to human-friendliness, so to speak. Make sure you can read the text without noticing it was written for the search engines.

The described keyword density should be the best variant for keeping your visitors interested, just as well the search engines.

#4 FP_Guy

FP_Guy

    Mach 1 Member

  • 250 Posts Club
  • 413 posts

Posted 12 July 2006 - 09:13 AM

Keyword density still has an effect in MSN and Yahoo, but no longer is true in Google. I've seen rankings where the keyword was only mentioned twice. Once in the title tag and once in the alt tag.

Yahoo's keyword density from research done on April of 2006 showed that the top listings always fell between 1% and 8%. Anything more was dangerous for setting off spam filters.

MSN's keyword density from research done at about the same time showed that the top listings always fell between 3.5% to 4%. Outside of this area was forgiven, but not ignored.

#5 Respree

Respree

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5901 posts

Posted 12 July 2006 - 09:28 AM

I wouldn't pay too much attention to keyword density.
http://www.e-marketi...r05/garcia.html

http://www.cre8asite...showtopic=30722

#6 bwelford

bwelford

    Peacekeeper Administrator

  • Site Administrators
  • 9008 posts

Posted 12 July 2006 - 09:41 AM

In addition, Google and possibly the other two biggies do semantic analysis. So they'll take account of synonyms of the keywords being searched for. There's no easy way for you to take that into your metrics.

#7 cre8pc

cre8pc

    Dream Catcher Forums Founder

  • Admin - Top Level
  • 13517 posts

Posted 12 July 2006 - 10:23 AM

And, remember when focusing on keyword density as a goal, this usually means the page will be frustrating to humans and especially infuriating to anyone with a screen reader.

#8 phaithful

phaithful

    Light Speed Member

  • Members
  • 800 posts

Posted 12 July 2006 - 11:05 AM

I still believe keyword density to be one of the ranking factors, but it's changed from just a simple percentage to more of what Barry is talking about:

Google and possibly the other two biggies do semantic analysis. So they'll take account of synonyms of the keywords being searched for.


Keyword density is really more of a metric for you to understand how to write more naturally for the user.

If you're looking for a number, run some keyword density reports on your competitors or well written articles. Then try and emulate that number... but remember you've got to emulate all the numbers for all the symantical keywords....

you're better of just writing a good article... use the above only if you know you're a bad writer or can't afford a good one :)

#9 send2paul

send2paul

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 2905 posts

Posted 12 July 2006 - 03:17 PM

Cathy - HI :)

Two keyword density tools:

1. Keyword Density and Position
2. Keyword Density Analyzer.

Use as guided in above posts, remembering that "normal" folks are the ones that count - not website spiders ;)

Paul

#10 kestrel

kestrel

    Mach 1 Member

  • Members
  • 477 posts

Posted 13 July 2006 - 03:12 AM

A piece of advice i was once given was to look at the top 10 results for your search term and measure their keyword density. You should then make yours the same or a touch higher.

This can all be ignored of course if you get enough inbounds.

K

#11 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 13 July 2006 - 03:33 AM

Sorry to disappoint you all, but keyword density is considered a myth.

Some brain-picking proof by Dr. Garcia - Keyword Density of Non-sense.

To sum up, the assumption that KD values could be taken for estimates of term weights or that these values could be used for optimization purposes amounts to the Keyword Density of Non-Sense.


That being said, I'll happily dive for another school of thought if someone shows some real repetitive results, shown after the Big Daddy update.

Edited by A.N.Onym, 13 July 2006 - 03:40 AM.


#12 bwstyle

bwstyle

    Ready To Fly Member

  • Members
  • 13 posts

Posted 13 July 2006 - 10:04 AM

Hey Guys, the KW density is most definitely a myth, I get some decent rankings and haven't even thought about KW density in years, there are so many other GOOD uses of your time I would hate to see you spend anything more than 1 minute looking at density analysis.

#13 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 13 July 2006 - 12:33 PM

According to a few of my test pages, high keyword density seem to still matter:

http://www.google.co...amp;btnG=Search

That being said, links trump on-page factors unless you're in a niche where no one is linking to each other.

#14 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 13 July 2006 - 01:06 PM

I recently did some testing in a highly competitive keyword area in making huge reductions to keyword density with no adverse affects at all to be seen. The site I'm testing with does not have huge numbers of links either.

However, it does have good keywords in the links it does have, and while those links are not especially high in PageRank (and most of the competition have more links) the links I do have are on quality sites that are spot-on relevant to the exact same keywords.

SEO just isn't the simple numbers game it used to be unless you talk about vast numbers. These days the numbers in the game are in terms of advanced mathematics used to measure quality and co-citation relevancy.

Afterthought afterthoughtThe beauty of such relevancy measures is that they will always defeat nonsensical and made-up words. Testing with nonsense phrases cannot test the performance of co-citation very easily, and cannot test the performance of relevancy, theme, or hubs/authorities at all. It really is quite beautiful in the way it makes most simplistic testing entirely self-deceptive with no effort at all.


#15 ukdaz

ukdaz

    Light Speed Member

  • Members
  • 738 posts

Posted 13 July 2006 - 05:14 PM

Been offline for a week waiting for BT to get my broadbandline in... I have moved from Cambridgeshire to Wiltshiren(UK) so been very frustrated waiting for an internet connection...

OK back on topic...

I tend not to consider keyword density UNLESS my content seems to be quite repetitive when I read it back. Try using variations of similar keyword phrases (consider looking up semantics on Google) 1and utilise that in your content rather than aiming for a certain percentage of keyword density - I'd say KW density used to work but with the development of s/e algorithms...

Not only will the search engines prefer your content, more importantly so will your visitors for having interesting & refreshing content!

Daz

#16 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 14 July 2006 - 11:12 AM

I recently did some testing in a highly competitive keyword area


That's not surprising. If you're "testing" pages in a competitive niche (which isn't really a test in the first place since you got too many factors going on at once), I don't expect on-page tinkering to make a dent in the rankings. SEO for Google nowadays is all about links. It comes down to building valuable websites and increasing site visibility. Keyword density has no meaningful place in that process.

Testing with nonsense phrases cannot test the performance of co-citation very easily, and cannot test the performance of relevancy, theme, or hubs/authorities at all.


I agree there are obvious limitations to nonsense phrases - but then again, the results will be the same if the pages were in English. They're not testing co-citation, relevancy, theme, or anything else. Those particular pages are isolating how Google reacts to on-page optimization. What do the results tell me? If you're really worried about on-page optimization, tweak title tags and stuff your articles with keywords - imo a complete waste of time.

Any boost gained by higher keyword density is like a drop in the bucket compared to one link from a relevant, quality site. Keyword density is not a myth, but like all other textbook SEO tactics, the boost you get from on-page tinkering is minimal - unless you're trying to rank higher on MSN or building bottom fishing spam.

#17 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 14 July 2006 - 12:16 PM

There are three myths here:

(01) You cannot write good copy if you have a high density of keywords. Yes you can.

(02) Keyword Density does not count. Yes it counts although not in a simplistic way.

(03) A high keyword density can hurt. Only if very excessive and repetitive. Keywords having a good distributiion up to 13% will not hurt you. Just do not stuff them next to each other.

A little experiment with a keyword density analyzer for new york hotels can substantiate most of the above statements.

Average keyword density for hotel was 4.7 % with the highest being 13.04%.York varied from 3.69% to 11.67%. If one counts semantically similar words the percentages are much higher. Most of the ten websites in the above SERPS are old websites with high pagerank and a high volume of quality backlinks.

Although I agree that one can get a good ranking without even the words appearing in the text, we cannot be sure that a bit of extra on-page optimization would have hurt! Since it is definitely proven that it cannot hurt, if used wisely, I would recommend the following:

(01) Keyword density 4.5 %.
(02) Add at least 2.0% semantically similar words.

In the early days of a website the keywords can give you a bit of an edge.

Yannis

PS Normally what I do. I just write the copy without any thought for keywords. I then test the keyword density and add sentences with the keywords in until I get close to 4-5%.

#18 phaithful

phaithful

    Light Speed Member

  • Members
  • 800 posts

Posted 14 July 2006 - 12:32 PM

very nice write up Yannis. I agree:

Yes it counts although not in a simplistic way.


Keyword density I still believe is one of those ranking factors, but since it's been exploited in the past, it has much much less value, but not dead altogether.

#19 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9339 posts

Posted 14 July 2006 - 03:00 PM

Yannis, I won't say this in a way that allows everyone to see it, so instead I'm going to put this behind an IQ test ;)

Looking at the highest ranking pages for an awful lot of SERPs will show you that the robots meta and the revisit-after tags are both highly effective. When you see why, you'll know that using a KWD analyser on pages probably built by people using a KWD analyser is a self-fulfilling prophect as sure as that of any classic Greek tragedy.

#20 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 14 July 2006 - 05:31 PM

What should be the exact keyword density in a website?


I'm not sure that keyword density is an approach that many search engines for the web have ever used to create their indices, but term weight may be what you are asking about.

These are fairly simple explanations of term weights compared to some of the other documents on the subject, but they aren't bad. The first one is jointly authored by people from Compaq, Google, and Altavista.

From The Term Vector Database: fast access to indexing terms for Web pages

How a Search Engine Works

#21 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 29 September 2006 - 08:11 AM

I'm fairly convinced now that its keyword frequency, not density that "matters" (for the lack of a better word).

http://www.google.co...amp;btnG=Search

Screenshot here: http://www.seo4fun.c...ency-image.html

As you can see, pages rank in order by the number of times a keyword is repeated (9x ~ 12x). Besides page size (number of words on a page), and keyword frequency, everything about those pages are identical, including PageRank.

I inflated the word count on the top-ranking page with keyword repeated 12 times to decrease keyword density. Lower density did not get in the way of the page ranking high. Based on this, it seems to me Google is ranking those pages based on frequency, not density.

#22 Respree

Respree

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 5901 posts

Posted 29 September 2006 - 10:02 AM

I'm not so sure I'm convinced at your findings - at least not yet

In your four scenarios, you repeat the keyword 12, 11, 10 & 9 times and ranking follows that pattern.

The density is 3%, 18%, 17% and 16%, respectively. Is it possible that Google simply prefers lower density (rankings 2 through 4 are pretty close in density)?

It seems to me you ought to 'clone' your number 4 ranked page, then adding the keyword 4 more times. This would result in a density of 21% percent and the total number of occurences at 13. It it (new scenario #5) pops to the top, then I'd agree with you, proving it that ranking is influenced by keyword occurences and not density.

Edited by Respree, 29 September 2006 - 10:03 AM.


#23 marianne

marianne

    Ready To Fly Member

  • Members
  • 20 posts

Posted 29 September 2006 - 11:38 AM

I'm going to make use of my student loan payment today and put that library/info science degree to some use and hope that I do not embarrass any of my professors in the process. :)

All search engines work the same. They pattern match a query term against their index of pages and then sort the pulled references according to a relevancy measure. Term frequency is a measure of relevance although it has fallen out of favor as search technology has gotten more sophisticated. If other relevance factors outweigh frequency of use, the page will not display high in the results.

In the glory days before keyword stuffing, query term frequency was a very significant factor in relevance presentation. The glory days are gone and we now reside in a world where it has some limited value. Concepts such as latent semantic indexing that associates terms with synonyms and Term Freqency/Inverse Document Frequency [tf-idf] provide a better determination of relevance to the user's query than how often a word is used on the page [IMHO, of course]. I wish that I had sat next to bragadocchio in class as I would likely be able to better explain the concepts.

Edited by marianne, 29 September 2006 - 11:40 AM.


#24 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 02 October 2006 - 01:25 PM

It seems to me you ought to 'clone' your number 4 ranked page, then adding the keyword 4 more times. This would result in a density of 21% percent and the total number of occurences at 13. It it (new scenario #5) pops to the top, then I'd agree with you, proving it that ranking is influenced by keyword occurences and not density.


Respree, thanks for the input. I may try that.

#25 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 02 October 2006 - 08:15 PM

I wish that I had sat next to bragadocchio in class as I would likely be able to better explain the concepts.


A library/info science degree probably would have been fun. Been thinking about some computer science classes, but that might be just as interesting.

Frequency rather than density is probably what you are looking for here, with local frequency over global frequency (the use of the word or phrase over the rest of the documents on the web) or as marianne put it, Freqency/Inverse Document Frequency [tf-idf]

Of course, other things likely come into play, such as the affect of anchor text pointing to a page. Other signals are impactful as well, which makes attempting to reverse hack and measure the impact of frequency rates pretty much an impossible proposition.

#26 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 03 October 2006 - 01:36 AM

I'm going to make use of my student loan payment today and put that library/info science degree to some use


Marianne, my comp sci degree has got to be the worst investment in my life :(

Of course, other things likely come into play, such as the affect of anchor text pointing to a page.


Bill, first off, I'm in no way advocating keyword spamming. In comparison to other types of off-page optimizations, I don't expect repeating keywords on a page 200 times will carry a page trying to rank for "real estate" too far. I think that was the point of Marianne's post, which I agree with: 1) Make a site crawlable 2) Create good content 3) increase visiblity. In short, don't get hung up on old "optimization" techniques.

However, let's not forget Matt Cutts' recent blog article, http://www.mattcutts...ders-will-love/, in which he says:

Notice what I did with keywords. I carefully chose keywords for the title and the url (note that I used “change” in the url and “changing” in the title). The categories on my post (”How to” and “Linux”) give me a subtle way to mention Linux again, and include a couple extra ways that someone might do a search–lots of user type “how to (do what they want to do).” I thought about the words that a user would type in when looking for an answer to their question, and tried to include those words in the article. I also tried to think of a few word variations and included them where they made sense (file vs. files, bash and bashrc, Firefox and Mozilla, etc.). I’m targetting a long-tail concept where someone will be typing several words, so I’m probably in a space where on-page keywords are enough to rank pretty well.


---

Other signals are impactful as well, which makes attempting to reverse hack and measure the impact of frequency rates pretty much an impossible proposition.


If you're comparing two regular websites, that would be true. But in a controlled environment, where you can rule all other signals out except the one factor you're interested in, those other signals - I would argue - are not impactful.

--

Frequency rather than density is probably what you are looking for here, with local frequency over global frequency (the use of the word or phrase over the rest of the documents on the web) or as marianne put it, Freqency/Inverse Document Frequency [tf-idf]


Bill, I had to read that over about four times before it made sense to me :)

I'd assume Inverse Document Frequency at a snapshot in time is a constant for a given query X, so that leaves TF as the only variable in the equation for any particular query. TF is determined by number of times a term appears on a page over the total number of words on a page. In that case, the total number of words on a page seems irrelevant, since I have a page with over 400 words ranking over a page with 66 words, with all else (except term frequency) being equal. In other words, tf-idf basically boils down to keyword frequency (correct me if I'm wrong).

For example, the idf for "SEO" would be 176,000,000 / 14,480,000,000.

Whatever that resolves to be is used as IDF to calculate the tf-idf value for, say, document A and B. Document A has 100 words total, with the word "SEO" appearing 3 times (TF = 3/100). Document B on the other hand has 1000 words total, with the word "SEO" appearing 10 times (TF = 10/1000).

So, the final values would be .......

(3/100) / (176,000,000 / 14,480,000,000) [Document A]
(10/1000) / (176,000,000 / 14,480,000,000) [Document B]

IDF is identical for both documents ranking for "SEO", so it can be ruled out as a constant. That leaves TF as the key variable:

3/100 VS 10/1000 (again, assume all other factors, including inbound links, for those two documents are equal).

In short, document A should rank higher than document B. However, that's not what I see happening. Document B is outranking document A. What other explanation is there except Google is not factoring in total word count? Inbound links are identical (in other words, identical off-page factors), no keyword in urls, titles, etc. The only two things different about those two pages are total word count and keyword frequency.

That's why I'm leaning toward the conclusion that keyword density is not a factor in Google's algorithm.

The glory days are gone and we now reside in a world where it has some limited value. Concepts such as latent semantic indexing that associates terms with synonyms...


How effective is keyword spamming? More effective than H1 tags, and almost as effective as keywords in TITLE element. If you're targeting low hanging fruits, as Matt Cutts said, that's all you really need.

LSI will remain a myth in my mind until proven otherwise.

I admit any tests done on this kind of thing is rudimentary at best, but so far what I'm seeing is less than promising:

http://www.google.co...amp;btnG=Search

Tedster over at wmw had this to say about LSI:

Over the past two years, every time asked a google engineer about whether they used lsi, they said they did not. Finally I got bit sharper -- lsi is a specific method. Just because they are not using that specific method (there may even be patent questions involved) doesn't mean that google is not using various forms of semantic analysis. I'd say they definitely are. They've purchased entire companies that specilize in semantics, such as Applied Semantics in 2003.


http://www.webmaster...gle/3085334.htm

Edited by Halfdeck, 03 October 2006 - 01:41 AM.


#27 egain

egain

    Gravity Master Member

  • Members
  • 121 posts

Posted 03 October 2006 - 04:09 AM

Further to the LSI conversation that seems to have wondered into this conversation a link to Mike Grehans recent post on LSI

http://www.clickz.co...ml?page=3623571

#28 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 03 October 2006 - 09:14 AM

What other explanation is there except Google is not factoring in total word count?


I think that you are right, there. Are there some other things that we should consider, though?

Other things happen to remove words from total word count, like removal of stop words, stemming, tokenization.

What impact does placement and presentation have upon the importance of those words or phrases in determining relevancy?

We could also have something like a block level analysis or visual gap segmentation being used to decide that words in some sections of a page should be given more weight that others.

What other signals might impact a determination of relevancy upon a page?

We don't know what kind of semantic analysis Google might be using, if any. Here's one of the whitepapers from Applied Semantics which talks about one form of technology that they purchased:

CIRCA Technology:
Applying Meaning to Information Management
An Applied Semantics Technical White Paper
http://www.adrenalyn...-technology.htm

#29 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 03 October 2006 - 02:34 PM

Great points Bill. I admit there may be some factors that I'm completely overlooking.

Other things happen to remove words from total word count, like removal of stop words, stemming, tokenization.


In my case, I doubt stop words or stemming are factors, because I'm using nonsense words (e.g. "Suspendisse at ipsum non nisi varius viverra. Quisque tincidunt adipiscing") The word "at" and "non" may be dropped, I suppose, but if I add more words to a page that drop can be compensated. I don't see stemming lowering word count.

What impact does placement and presentation have upon the importance of those words or phrases in determining relevancy?


Both pages are the same exact format, same HTML layout. One or two paragraphs of text, no H tags. So imo that rules out presentation as a possible factor. Keyword is scattered randomly (and evenly) in throughout a sentence or a paragraph. For example:

http://64.233.187.10...q...t=clnk&cd=5

Other test pages also tell me keyword placement within a sentence/paragraph doesn't seem to make a noticeable impact on ranking.

We could also have something like a block level analysis or visual gap segmentation being used to decide that words in some sections of a page should be given more weight that others.


I'm only dealing with one or two paragraphs of text here (no nav menus, tables, divs).

I still *believe* the major factor in play here is keyword frequency. If on-page factors were completely ineffective, I would think the search engines would be spam-free. I don't think these SERPS happen by accident:

Yahoo:

http://search.yahoo....s...p-rd&dups=1

1. Home page
2. The "SEO Combined" page (keyword in anchor text linking to the page, keyword in url, keyword in title, and keyword repeated 3 times on the page, twice in H tags) - ranks high on all major engines.
3. Keyword repeated 11x
4. Keyword in H1 positioned top of page
5. Keyword in META keyword tag
6. Keyword in H1 (bottom of page)
7. Keyword repeated 9x
19. Keyword repeated 10x (??)

http://search.msn.co...%...0&go=Search

MSN:

1. Keyword in TITLE
2. Keyword repeated 11x 66 words
3. SEO Combined
...
5. Home page
6. Keyword repeated 10x 57 words
7. Keyword repeated 9x 57 words
....
9. Keyword repeated 12x 446 words

Google:

1. SEO Combined
2. Home page
3. Keyword in TITLE
4. Keyword repeated 12x
5. Keyword repeated 11x
6. Keyword repeated 10x
7. Keyword repeated 9x

Coincidence? Perhaps. Oversimplified? Likely.

Here's a recent post on ihelpyou forum by Graywolf:

Doug man you crack me up, that's really not keyword stuffing. Now IMHO this is an example of keyword stuffing and spamming h**p://www.wolf-howl.com/seo/aequeosalinocalcalinosetaceoaluminosocupreovitriolic/

not linked on purpose cause I know you don't like it when I do that. However even with insanely high keyword density the page still ranks.

h**p://www.google.com/search?q=Aequeosalinocalcalinosetaceoaluminosocupreovitriolic

kinda funny when stuff doesn't work the way google tells you it does isn't it


http://72.14.209.104...lient=firefox-a

#30 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 03 October 2006 - 08:54 PM

I'm enjoying your experiments, halfdeck.

In my case, I doubt stop words or stemming are factors, because I'm using nonsense words (e.g. "Suspendisse at ipsum non nisi varius viverra. Quisque tincidunt adipiscing")


I wonder of a number of those are being filtered in some way.

In this older paper, The Term Vector Database: fast access to indexing terms for Web pages, they talk about calculating term weights, and removing some of the terms that occur least frequently on the web:

We eliminate the least frequent third because they are noisy and do not provide a good basis for measuring semantic similarity. For example, one such term is hte, a misspelling of the. This term appears in a handful of pages that are completely unrelated semantically. However, because this term is so infrequent, its appearance in term vectors makes those vectors appear to be quite closely related.



Might that have some meaning for your experiments? Don't know if it does, but thought it was worth pointing to.

#31 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 05 October 2006 - 04:21 AM

I wonder of a number of those are being filtered in some way.


You got me there, Bill. Thanks for the input as well; you loosened my brain up a bit :huh:

#32 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 05 October 2006 - 04:41 AM

Halfdeck, I tend to use existing english texts for most of my tests - and just sprinkle my test keywords in randomly. I pick up a few works by the same author at http://www.gutenberg.org/, use a tool to extract sentances (and clean up things that I don't need, dates, times, abbreviations, etc) and randomly mix them together to create pseudo-english pages like these. It might also prevent the filtering which Bill mentioned.

John

#33 Halfdeck

Halfdeck

    Gravity Master Member

  • Members
  • 110 posts

Posted 06 October 2006 - 02:38 AM

Thanks Softplus, I might give that a try.

#34 Guest_orion_*

Guest_orion_*
  • Guests

Posted 08 October 2006 - 09:55 AM

We have addressed this in a recent blog. Regarding the notion that search engines define term weights using w = tf*IDF: Why stick to this expression and assume that current search engines like Google, Yahoo or MSN use this?

For those not familiar with tf*IDF,

tf = term frequency
IDF = log(D/d)

where D is total number of documents in an index and d is number of documents containing the term in question.

Note IDF is a log scale at a given base b, where b can be 10, 2, etc. In many textbooks b is assumed to be 10, so is presented as a base-10 log scale. However, some research papers discuss binary models and use a base-2 scale. Logs are used simply because they are additive and simplify the comparison betwen large and small numbers.

Also local weights can be modified, so the well known local weight scheme, i.e., L = tf, is just one of many local weight schemes. Believe me, there are many ways of computing local weights other than just a direct mapping of the form L = tf.

In issue 2 of our IRW newsletter we mentioned that in 1999 Erica Chisholm and Tamara G. Kolda from Oak Ridge National Labs (ORNL) reviewed several term weight schemes in New Term Weighting Formulas for the Vector Space Additional weighting formulas have been published since then. All these accomodate to defining term weights as

w = L*G*N

Regarding tf*IDF as given above, this was used in the Classic Term Vector Model from the seventies and eighties. The expression has several limitations. To mention a few the formula ignores

1. negative weights
2. entropy weights (E)
3. normalization weights (N)
4. link weights

The expression ignores the relative position of terms with other terms, term ordering, contextuality, and many other things that can be incorporated into the notion of relevancy, similarity and relatedness. The equation also ignores how other documents in the collection are inducing similarity to a given document or whether such effect is negative or positive in nature.

Regarding 1, negative weights can be accounted for with for example, a probabilistic model. To mention just one

w = tf*log((D - d)/d)

This expression is also considered part of the family of tf*IDFs. In fact, in the past some authors have refered to it as IDF=log((D - d)/d), too. The expression will change from positive to negative whenever d is greater than 50% of D. This means that valid terms can have negative weights. This is purely a mathematical effect, not a particular weighting mechanism or filtering from a search engine. However, this leads to interesting retrieval and scoring (ranking) complications.

Regarding 2, entropy weights can be incorporated by adding expression of the form pln(p) where p is term probability.

Regarding 3, normalization can be incorporated by normalizing the length of documents. This can be done by taking crude ratios or by using document "pivoted" normalization. All this is discussed in the ORNL paper.

To top off, global weights can be made recursive and adaptable to a query or geolocation, so they are not just a mere constant. Relevance feedback can be incorporated in the background so the initial query can be expanded to discover and append new documents.

LSI can be triggered as ancilliary mechanism to do concept matching and expand answer sets. The final answer set can be the result of reranking and mergin previous sets (fusion) or of purging. All this can be transparent to end users seating at the end of a search box.

All this suggests that we need to think in terms of the co-retrieval power of words, rather than on mere term-to-term matching. I hope to put out a piece on this soon and to show how even in these cases c-index calculations can be used to address co-retrieval.

Cheers
Dr. Garcia

PS. I have corrected one expression and few typos.

Edited by orion, 08 October 2006 - 11:10 AM.


#35 marianne

marianne

    Ready To Fly Member

  • Members
  • 20 posts

Posted 11 October 2006 - 01:50 PM

Wowza, my Catholic school math is failing me on a lot of cognitive levels with your reply Dr. Garcia. Could you rephrase as if explaining this to a drama major? :)

Many thanks.

#36 Guest_orion_*

Guest_orion_*
  • Guests

Posted 15 October 2006 - 10:14 AM

I'll be happy to. Could you be a bit specific?

Meanwhile I would say this: the idea of trying to apply the IDF concept introduced by Karen Sparck-Jones back in 1972 to an IR problem from 2006 and assume that search engines are sticking to that model to score weights is contraindicated.

The fact is that the tf*IDF scheme is just one of many term weight schemes and the simplest one that incorporates local and global weights. While the IDF concept is an IR keystone concept, tf*IDF is just one way to look at weights of terms. There are many and better schemes that incorporate IDF or derivative of this, believe me. The tf*IDF primitive model is taught at CS schools as introductory material for advanced concepts since it has several limitations (some described in previous post).

Regarding the origin of the IDF concept. This is derived from Zipf Law. It was introduced for the very first time in 1972 by Sparck-Jones in the Journal of Documentation, in a paper called

A statistical interpretation of term specificity and its application in retrieval

More on this here. I highly recommend members of this forum to read the IDF Page, which is a tribute to Sparck-Jones brilliant work.

Edited by orion, 15 October 2006 - 10:19 AM.




RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users