Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Photo

How is C-Index Used?


  • Please log in to reply
26 replies to this topic

#1 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 10 September 2005 - 01:15 PM

C-Index (co-occurence index) is the likelihood that terms will appear together. For instance, if I say "Opera," many of you would think "browser," or Norton --> antivirus, and that fact is reflected in natural word choices of web site authors, thus helping SEs create logical content clusters, which are (as I understand it) related to how search engines rate relevance.

I would be interested in a discussion of how C-Index analysis is used in SERPS, especially as related to niche terms.

How might an especially high or low C-Index effect a site's SERPS, time in the sandbox or adsense targeting?

I am assuming that a good combination for targeted advertising would be keywords denoting a logical content area with a high C-Index, in conjunction with a low-ish number of search results. Would that be any different for SERP-related SEO? Some adsense pages are spammy, yet spam is not a long-term solution for SEO.

Elizabeth, the self-educated ;-)

#2 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 23 September 2005 - 01:46 AM

Excellent questions, Elizabeth.

We discussed this topic a little in a few threads around here, but it does deserve some additional coverage.

Imagine that you are doing some keyword research, and you want to know which of the following terms might be the best ones to use:

engineering training
engineering workshops
engineering seminars
engineering classes

How might you go about deciding upon which of those to use? They are all pretty close in meaning. They all might be equally useful in page titles, and in headlines, and in links to pages.

You could check search volumes through a tool like wordtracker. Imagine that you do, and the traffic for the terms appears to be similar for each of the terms. What else could you do to see if one might be a better choice than another?

Could you use a c-index? If you do, it could tell you which of those pairs of words tend to appear together in a document more frequently than the others within the universe of terms that might appear on the web (or at least within a specific search engine.)

Dr. Garcia has an excellent set of articles on how to calculate a c-index, and some potential pitfalls to the process at:

Keywords Co-Occurrence and Semantic Connectivity
http://www.miislita..../c-index-1.html


Keyword 1 and keyword 2 make keyword phrase 12 (K1, K2, K12)
They all return different sets of documents (N1, N2, N12)

With the limitations explained in Dr. Garcia's article, this would be how to calculate the C-Index (in parts per thousand):

(N12/(N1 + N2 - N12))* 1000

engineering training
(162,000,000/(781,000,000 + 1,340,000,000 - 162,000,000))*1000 = 82.7 ppt

engineering workshops
(33,900,000/(781,000,000 + 257,000,000 - 33,900,000))*1000 = 33.8 ppt

engineering seminars
(34,500,000/(781,000,000 + 227,000,000 - 34,500,000))*1000 = 35.4 ppt

engineering classes
(52,400,000/(781,000,000 + 366,000,000 - 52,400,000))*1000 = 47.9 ppt

From those calculations, we see that in Google, engineering and training are cited as appearing in the same document more often than any of the others, and then engineering and classes, with engineering and seminars next, and then engineering and workshops.

This may mean that those words (engineering and training) are more sematically connected than the other sets of words, in Google's universe of documents.

So, armed with this knowledge, where would you go from there?

#3 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 23 September 2005 - 10:15 AM

"Classes" could refer to several things - not as targeted as more specific wording.
Seminars and workshops are types of training.

The confirmation of "training," would lead me to the next specifications, especially since this is a big area.

-- What kind of engineering? Electrical? Structural? This might not have as many keywords to look at if the data is vertical, correct? There'd be the technical designation and also a common name, simular to the difference between CPA and accountant or bookkeeping person.

-- Is there a more specific way to saying what kind of training, or a reason to specify? Cram session? Retreat? School?

-- The focus of the training would be a significant qualifier. lateral forces? building materials? biological research?

My guess would be that though the most specific words would have less competition, the general terms (in-site directory pages?) would be more likely to get bookmarked - is this a good guess?. Specific terms would get better Adsense targeting. And I'd be concerned that keyword research may put the cart before the horse at times - what if people don't use the same terms that are in the web sites?

For example, "moves" is a keyword that is highly related to karate, but karate kata is what karate moves are actually called. People who already do karate would look for kata. New blood would look for moves. Their interests and needs would be different.

Karate kata would be easier to reach because of less competition. Karate moves would have greater traffic potential because of an audience that includes both those in the know and the curious.

Would the niche term have a different impact on sandbox time? If so, is there a way to tell how niche you can go before the sandbox says "whazzup?"

Gotta go!

E

#4 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 23 September 2005 - 10:50 AM

I'll try a different example, but without doing all that math.

Say I wanted to write three pages about blues music, and I wanted to focus on specific instruments on each of those pages. I could use a number of different instruments:

Drums
Guitar
Trumpet
Saxaphone
Harmonica
Banjo

I might want to try to see which of those instruments are mentioned on the same pages as Blues, within the universe of all the pages on the web 9as indexed by Google) that talk about the blues and those specific instruments.

I could create a C-Index to try to get an idea. It might give me an idea that most documents indexed in Google about the Blues and specific instruments tend to focus on Guitar, Saxaphone, and Harmonica the most (a guess), and almost never ever mention the Banjo, giving me a very low C-Index result for Blues and Banjo. Doing a little research to follow that up, I might find that the Banjo was rarely ever used in Blues music.

The basic idea behind a c-index is to let you know how frequently words might be cited together in a body of documents as opposed to how frequently they are cited apart.

Can this help you with adsense, or with a sandbox? I don't know if it really can. But it can give you a glimpse into the relationships between words, and might help you make some choices, like my blues instruments one above.

#5 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 23 September 2005 - 12:29 PM

Thinking out loud...

Then going from that, you would want to target both the high and low C-index. The high one suggests these "naturally" occur together. But the low one could indicate an easy trail.

Could C-index research be used then to identify unique content?

#6 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 23 September 2005 - 12:57 PM

The basic idea behind a c-index is to let you know how frequently words might be cited together in a body of documents as opposed to how frequently they are cited apart.

Hmmm... having more to do with human keyword insight than a SE's internal decisions? A content cluster needs to make sense to the target audience.

I've been working under the theory that keyword insight (as benefits human searchers) should come before SEO decisions made for a SE, because the mysteries of algo balancing will shift from update to update.

AND, I would like to gradually understand more of how a SE attempts to gain insight into the target audience's thought process, balancing that with judging SEO tactics, in the end creating quality search results. The point for both SE's and humans is quality results, eh? Sooo, I want to know how algos "think," too.

I don't know how much of the math I will understand. I can factor quadrinomials and in a pinch I can graph a parabola, but haven't made the connection to how to use that background for understanding how SE's balance conceptual issues.

I do know that this kind of info made more sense this year than it did last year. :-) :-) Gleeful thanks for happening to suggest a resource that I didn't get the last time!
Keywords Co-Occurrence and Semantic Connectivity
http://www.miislita..../c-index-1.html

And if a recovering math phobe like me can get it, well, here's a vote for the learning process of anyone else.

Elizabeth

#7 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 23 September 2005 - 01:09 PM

Then going from that, you would want to target both the high and low C-index. The high one suggests these "naturally" occur together. But the low one could indicate an easy trail.

Could C-index research be used then to identify unique content?

Exactly. I am wondering if there's a way to tell how far to push it besides keyword gathering, guessing and waiting for logs to analyze. Go too wide and there's not going to be SE traffic unless the site is high ranking to start with. Go too narrow and there'd better be a significant number of niche pages, to draw in significant traffic from buckets of little niche searches.

I jumped from "how might I quantify this?" straight to "hmmm, C-index."

Am I barking up the right tree?

Elizabeth

#8 randfish

randfish

    Hall of Fame

  • Members
  • 937 posts

Posted 23 September 2005 - 01:55 PM

This is a kind of very simplistic way of thinking about C-Indices - http://www.seomoz.or...tail.php?ID=304

They can also be used for other fun things, like realizing what other pages would be best to get links from, how best to "mix up" anchor text and even what other terms to have on your pages.

Think of it as a way to measure how "connected" the SEs think two terms are. The more connected, the better they go together - like eggs and bacon or pasta and parmesan.

#9 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 23 September 2005 - 11:10 PM

My big goals at SEOmoz for the future center around building two tools - one that can calculate how "on-topic" a particular document is, and another that can, given a document, determine it's primary subject matter and map that to a large, hierarchical ontology of concepts.


Yes, the latter tool was on my mind when I was thinking about this thread tonight. The reverse application is actually more interesting, fascinating even. Have you ever looked/tried Tropes and Zoom from Semantic Knowledge?

Back to search: a set of searches, as provided by something like Wordtracker, could then have its own c-index as well...

#10 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 24 September 2005 - 08:46 PM

Checking out the free version of Tropes Zoom is now on my to-do-later list. Thanks for the link, Ruud.

Comparing human knowledge of terminology with C-index is fascinating. Do SEs also use categorical indexes made by humans? Encyclopedia topics? Secret spellbooks, barrels of monkeys, cans of worms? ;-)

Elizabeth

#11 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 24 September 2005 - 08:57 PM

Comparing human knowledge of terminology with C-index is fascinating.


*nods* huh-huh. But as you can see from Amazon's Statistically Improbable Phrases, you can do a lot more with it.

Like... spot spam-nonsense auto-generated pages, perhaps?

#12 Guest_orion_*

Guest_orion_*
  • Guests

Posted 28 September 2005 - 08:09 PM

Happy many are interested in the metric. Many have misunderstood/tried to dilute the concept. Without not to mention "c-index tables and calculators". Here is one way on how not to use c-indices: http://bombdogstudio...t-it-means.html

If interested, in the original SEWF thread I mentioned how the metric can be used for email and link analysis.

Contrary to some opinions, c-indices or co-occurrence do not tell which terms are more important. To elucidate that, one would need to conduct an on-topic analysis that would include co-occurrence and clustering techniques. [No need to resource to LSI, which is a crunching numbers approach with the dreaded dimensionality reduction curse].

Sometimes I feel many don't get this co-occurrence thingy which is why I offered in the past to some of you putting together seminars on c-indices, EF-ratios and semantics. The offer is stil on the table.

The c-index is a conditional probability. Since the co-occurrence phenomenon can be global, local or fractal, the conditions differ in each case. Also there is the question of co-volume (search volume co-occurrence).

Co-volume is treated different. Since it cannot be measured with current keyword research services (wordtracker, overture, etc), then what you get from those services can be garbage in, garbage out.

At SES, San Jose I meet Andy Mindel (Wordtracker) and one of his programmers and explain this to him and why their tool fail in this respect. Present were Mike Grehan. In a side apart I also showed to Mike an indepth reasoning written on a naptkin. As of today, they still have no answer to the problem I presented.

Back to the definition. Essentially when I invented the metric I defined more or less as follows

A c-index is the probability that k1, k2,....kn would co-occur provided that k1, k2...kn have occurred.

In addition, a c-index is a normalized co-occurrence. The co-occurrence phenomenon can take place in a corpus or database, in which case we talk about global co-occurrence. This appear to be the one discussed by the above fine posters.

When using the metric, one need to consider the scope of the terms. Are they broader, narrower or specific. Also one would need to consider the ontology. A c-index or co-occurrence data for Noun-Noun not always can be compared with some of the form Noun-Verb, Noun-Adjetive, etc or derivative of these.

The c-index metric can be used in many settings. One of such is when we want to use the time dimension; i.e., to measure temporal trends. In some cases it performs well. In other cases one can do better with the EF-Ratio metric. I'm currently am conducting experiments on temporal co-occurrence which shows a clear scenario in which EF-Ratios perform better.

#13 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 28 September 2005 - 09:53 PM

Sometimes I feel many don't get this co-occurrence thingy which is why I offered in the past to some of you putting together seminars on c-indices, EF-ratios and semantics.

Ahhh! Hello! :wave:
After a quick scan of previous posts I now know where I first saw the "Dr. Garcia" link. :-) How time flies under the influence of intensive Cre8asite browsing!

Sometimes I feel many don't get this co-occurrence thingy

Is there sometimes a hesitance to get into the nitty gritty, in expectation of blank looks? Perhaps before being able to understand more, one would need to absorb the vocabulary with which to ask Really Basic Questions? Would you be so kind as to nominate some terms?

Contrary to some opinions, c-indices or co-occurrence do not tell which terms are more important. To elucidate that, one would need to conduct an on-topic analysis that would include co-occurrence and clustering techniques. [No need to resource to LSI, which is a crunching numbers approach with the dreaded dimensionality reduction curse].

Nods, with face in the form of a question mark. ;-) What kinds of clustering techniques? What level of math is needed to understand this kind of number crunching?

Good to see you back.

Elizabeth

#14 Ruud

Ruud

    Hall of Fame

  • Hall Of Fame
  • 4887 posts

Posted 28 September 2005 - 11:06 PM

Hello Dr. Garcia. Thank you for entering the thread. It's outstanding for everyone participating, and everyone reading, to have The Source present here.

I'd love to ask you some questions I, and others I think, have.

When using the metric, one need to consider the scope of the terms.


Here I wonder if, when applied to looking at a database such as Google, the metric remains valid, or indeed is applied correctly, if instead of noun-noun or noun-verb I would try to look at sentence-sentence or sentence-noun.

Co-volume is treated different. Since it cannot be measured with current keyword research services (wordtracker, overture, etc), then what you get from those services can be garbage in, garbage out.


But, if I understand you right, if applied to a body of actual searches it can still show me that a search for k1 and k2 has a kn occurence, right?

Since the co-occurrence phenomenon can be global, local or fractal, the conditions differ in each case.


The metric seems so clean that I have the impression it can be applied to anything? I like randfish's example with cereal boxes.

In addition, a c-index is a normalized co-occurrence. The co-occurrence phenomenon can take place in a corpus or database, in which case we talk about global co-occurrence.


What changes when we move from global to local, for example? Doesn't the calculation remain the same?

In other cases one can do better with the EF-Ratio metric.


Could you give one or more examples?

And to end the question session; what, in your opinion, is the most common pitfall or misconception around the c-index?

#15 Guest_orion_*

Guest_orion_*
  • Guests

Posted 30 September 2005 - 02:52 AM

Hi, there.

Sorry I didn't respond before. Too many questions with too little time. I'll try to provide succinct answers without getting into IP protected material.

Would you be so kind as to nominate some terms?


c-indices = The probability that k1, k2, ... kn would co-occur provided that k1, k2,...kn have occurred.

EF-Ratio = Given a query Q = k1 k2 k3...kn consisting of n terms and where each k is a single term. The probability that a search for Q in FINDALL mode would return documents with the EXACT sequence Q = k1 k2 k3...kn is its EF ratio.

FINDALL = Also known as AND. The system must return documents containing all query terms. Terms can appear anywhere in the document without regard for order and proximity.

EXACT = Unlike FINDALL this is a search where order and proximity does matter. Contrary to popular opinion, this is not a search for phrases. It can return documents with k1 followed by k2 but both separated by a given separator (space, delimiter or stopword)

Thus, EXACT is a subset of FINDALL, but EXACT itself is a composite of sub-subsets. These sub-subsets are defined according to the nature of the separators.

Only when the separator is a space we can talk about what we perceive as a phrase.

A c-index calculation in FINDALL and a c-index in EXACT are different and hint different information.

More definitions and concepts are defined here
http://www.miislita....g-patterns.html

Note that we compute co-occurrence from a given database. So c-12 index in Google will be different from one computed from another search engine since the answer sets n1, n2, n12 are different in each engine. This allows one to inspect an target a specific engine. It would not make sense combining sets and co-occurrence data from dissimilar databases with different parsing rules for delimiters, stopwords and the like.


What kinds of clustering techniques? What level of math is needed to understand this kind of number crunching?


I use standard and experimental techniques. Standards: dendrograms, k-means, vector analysis, similarity distances, spanning trees, digraphs. Experimental: cann't disclose. Chapter 5 of Graphical Exploratory Data Analysis (du Toit, Steyn and Stumpf; Springer-Verlag) is an oldie but is a good start.

Here I wonder if, when applied to looking at a database such as Google, the metric remains valid, or indeed is applied correctly, if instead of noun-noun or noun-verb I would try to look at sentence-sentence or sentence-noun.  


For global co-occurrence, we measure the frequency of terms queried and co-occurring collection of documents as they appear in a database. To do as you propose we would need to instruct the system (e.g. Google) to recognize a very long text stream as a k or combination of k's. This transform the problem to c-indices combined with searches in EXACT mode.

If you are describing a case of using sentence co-occurrence in individual documents, or sentence-noun in a given document then we are talking about local co-occurrence, not global.

Prof Bruce Croft has done this using Local Context Analysis (LCA) long time ago.

Regarding the ontology, in his famous paper on LCA, Croft, Chair of the CS, U. Massachussetts demonstrated that noun groups are more flexible for conducting query expansion and convey more precise information than other combinations of n-grams. There is an entire theory that ties (at least in English) sentence productio rules and L-systems to language learning and semantics. References here
http://www.miislita....timization.html


But, if I understand you right, if applied to a body of actual searches it can still show me that a search for k1 and k2 has a kn occurence, right?  


Here we need to look at the meaning of the numbers and what we are trying to measure or estimate. A c-index estimates the phenomenon of terms co-occurring in a given database.

The problem with Wordtracker is that it gives you search counts that are a composite (let say of n1, n2, n12) from many different databases. Using their results would produce a c-index, true but from which database? Thus, computing a c-index would not mean much in this case.

There is also another problem: At least last time while at SES, I mentioned this to WordTracker in a private meeting and later during a dinner to Dr. Berkhin, senior director of data mining at Yahoo): Given the search volume count from these toys (wordtracker or overture), how the user would know which fraction of the search volume count was due to querying in FINDALL, which fraction were submitted in EXACT? This is extremely important when we do keyword research for the purpose of doing data mining, constructing lexical trees or targeting specific sequence of terms.

This is another reason of why I don't believe in keyword research results/services that relie on these toys. There is a post I made at SEWF on this subject which sheds more light on this.

True, that we can compute c-indices from search volume, but it need to be from a given database that allows you to discern between FINDALL and EXACT to properly compute a co-volume measure. For two terms, is not that much of a problem. The real problem is when we deal with 3 or more terms in which case co-volumes are hard to quantify.

The metric seems so clean that I have the impression it can be applied to anything? I like randfish's example with cereal boxes.  


Rand is a great guy and a personal friend of mine. Sometimes he tries to in good faith oversimplify to help others to understand new things. I very much appreciate his effort. Unfortunately, sometimes during the simplification process the exact and intended meanings of the IR concepts (c-indices, term vector theory, LSI, temporal link analysis, etc.) have been diluted/incorrectly presented or wrongly interpreted. We are all human and we all make mistakes. I have my own good load of errors.

What changes when we move from global to local, for example? Doesn't the calculation remain the same?  


Yes. However, in local, we compute c-indices and EF-Ratios using passages, where

n1, n2 and n12 are number of author-defined passages containing k1, k2, and k12, accordingly.

This an area in which c-indices and EF-Ratios helps a lot for optimizing individual documents. Finding the optimum passage length can be tricky and is a real art. I initially defined a passage as a sentence, where

n1, n2 and n12 = number of sentences containing k1, k2 and k12, respectively.

The problem with this definition is that some sentences are larger than others. Thus, you need to transform this to a problem of sequencial simplex optimization in which the length of the passage is optimized first. There are some proprietary methods for doing this I prefer not to talk about it. Still, defining sentences as passages is far better than resourcing to myths (e.g., keyword density).

Could you give one or more examples?  


UF... That's in an ongoing experiment I'm rushing to complete. I'll be happy to show examples/results after completion.

And to end the question session; what, in your opinion, is the most common pitfall or misconception around the c-index?  


It is a tool not a silver bullet.

To understand really the meaning of the numbers we are crushing we need to look at the ontology involved and the semantics of the terms and combination of terms.

#16 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 01 October 2005 - 09:30 AM

OK, let's see if I get the tip of the iceberg. I'll re-state, in hopes of reassuring myself. ;-)

EF-Ratio = Given a query Q = k1 k2 k3...kn consisting of n terms and where each k is a single term. The probability that a search for Q in FINDALL mode would return documents with the EXACT sequence Q = k1 k2 k3...kn is its EF ratio.  
 
FINDALL = Also known as AND. The system must return documents containing all query terms. Terms can appear anywhere in the document without regard for order and proximity.  
 
EXACT = Unlike FINDALL this is a search where order and proximity does matter. Contrary to popular opinion, this is not a search for phrases. It can return documents with k1 followed by k2 but both separated by a given separator (space, delimiter or stopword)

Does EF = EXACT:FINDALL?
This would be a ratio between pages where terms are found in specific proximity, and pages where they are found in anywhere on a page.

Can EXACT be either "deep blue sea" or deep, blue, sea?
FINDALL would then be any occurence of deep, blue, and see, alone or in any order, including "deep blue sea."

This could indicate the presence of both a phrase and discussion of that phrase, correct?

If the phrase "deep blue sea" is found in a page once and the words deep, blue, sea are found a total of three times in the same page, how would that be written in an EF ratio?

If "deep blue sea" is found once, but the other three words total deep found twice, blue once, sea three times, how would that be written as an EF ratio?


how the user would know which fraction of the search volume count was due to querying in FINDALL, which fraction were submitted in EXACT? This is extremely important when we do keyword research for the purpose of doing data mining, constructing lexical trees or targeting specific sequence of terms.

Yes. If I understand correctly, this vagueness has always bothered me about search.

When searching for terms like - php case - after the first few, Google gives you "see results for php switch." It'd be more user friendly to offer a little list of contextually related search terms at the top of results, alongside an alternative spelling if indicated.


I use standard and experimental techniques. Standards: dendrograms, k-means, vector analysis, similarity distances, spanning trees, digraphs. Experimental: can't disclose. Chapter 5 of Graphical Exploratory Data Analysis (du Toit, Steyn and Stumpf; Springer-Verlag) is an oldie but is a good start

Hokay. :-) :oops:
Here, in order to follow what you offer with such passion, I'd need a background in statistical terminology. Do you believe that a solid foundation in using this kind of math is needed in order to accurately use the concepts behind keyword insight? Or could someone without that kind of math use the concepts while inputting terms into a pre-determined algo? It seems to me that translating the math into concepts that are readily graspable for others would fill workshops, sell SEO services and add to respect for what SEO actually does. Otherwise, wow, there's them blank looks.

For some people, all I have to do is say the words "Linux or Windows Server" and they are lost. For others, I can say things like "to help determine search engine results, spiders count terms and how important they seem to be, and add that to dozens of other factors." Then I get to explain that to speed searches (lol) during a search, the SE is relying on what spiders have ALREADY gathered - that part is not hard to digest, though some people will still glaze over if I insert the word "database" in the explanation.

SE related concepts ring my bell when I can use them to help motivate others. If someone who is just starting out can find their business (or their recent article) by inputting the name of the article, they're on the phone networking, telling others to search for - title of article. Keep them motivated and there will be more gist to work with. Give them a sense of the depth of insight needed, and they are more patient with the time needed to come up with real solutions.

Thanks to Dr Garcia for joining the discussion.

Elizabeth

#17 Guest_orion_*

Guest_orion_*
  • Guests

Posted 03 October 2005 - 03:41 PM

Hi, there.

Does EF = EXACT:FINDALL?  
This would be a ratio between pages where terms are found in specific proximity, and pages where they are found in anywhere on a page.

Can EXACT be either "deep blue sea" or deep, blue, sea?  
FINDALL would then be any occurence of deep, blue, and see, alone or in any order, including "deep blue sea."  

This could indicate the presence of both a phrase and discussion of that phrase, correct?  

If the phrase "deep blue sea" is found in a page once and the words deep, blue, sea are found a total of three times in the same page, how would that be written in an EF ratio?  

If "deep blue sea" is found once, but the other three words total deep found twice, blue once, sea three times, how would that be written as an EF ratio?



Yes.


EF-Ratio = Results in EXACT Mode for Q/Results in FINDALL for Q

where Q is the query sequence k1 + k2 + .....kn.

This working definition and the technical one are given in the overlapping patterns link mentioned in the previous post.



Thus, in Google I would get the following for k123 = deep blue sea:

Results in EXACT Mode = 1,930,000
Results in FINDALL Mode = 15,800,000
EF-Ratio = 1,930,000/15,800,000 = 0.12 or 12%

This tells me that in Google 12 out of 100 documents containing the terms, deep, blue and sea in no particular order are targeting the exact sequence deep blue sea. (Note that I use the expression "sequence" and not "phrases". Reasons are given in previous post).


Now compare this sequence with a different sequence but same terms, say blue sea deep, again in Google:

EF-Ratio = 13,300/14,500,000 = 0.0009 or 0.09%

Compared with the previous one this is a less targeted sequence for the same terms occurring in no particular order in Google. Thus, the EF-Ratio allows me to discriminate between sequences, popular sequences in a given search engine and to identify natural language sequences.


The EF-Ratio can also be used to discriminate between proper grammatical sequences. Indeed, one of the motivations for developing the EF-Ratio metric was that I initially wanted to find a way of using Google to identify and learn about proper grammar sequences in English. To some extent, the EF-Ratio can help you to sort out the most popular, naturally occurring or proper sequences in a given database. This can also be used to identify sequences from a language that is not your first language.


For instance, you can use EF-Ratios to identify sequence of terms in Spanish. If we want to optimize a document in Spanish but we don't know Spanish, this is one way of identifying candidate combinations of terms and sequences. Keep in mind that an EF-Ratio is a tool, not a replacement of human editors. You may need copyright work to validate those sequences. So I would use the tool to leverage some work.


In addition, if I want to appeal to the larger latino community I would need to use what we call "neutral" Spanish and stay away from using country-specific regionalisms, which can affect the meaning of the concepts. If I want to flirt with a girl in Mexico, I can shout "que cuero", meaning that she is pretty. The same expression in Puerto Rico means that she is a prostitute. So, I still need to know the exact meaning of term sequences in a given country.


BTW, I keep a collection of several SEO-translated material that sound hilarious for the wider international audience from Latin America because the regionalisms or hispanic barbarisms injected. Many term sequences from, let say, Mexico, are not used or do not exist in many Spanish-speaking countries. I guess a similar analogy applies for the way English sequences are used in UK, USA, South Africa or Australia. Thus, an EF-Ratio for proper word usage is just another tool.


Do you believe that a solid foundation in using this kind of math is needed in order to accurately use the concepts behind keyword insight?



No. SEOs don't need to use too much math to understand these concepts. They need basic visual trainings without all the math mumbo-jumbo. Still they need to know how to add, substract, take ratios, and compute percentages. If they know how to use EXCEL, even better. If not, they can be trained on the fly in a "do this, do that" crash course seminar.


Of course, before crunching numbers, the SEO would need to know about ontological relationships rather than blindfold entering data like a monkey into spreadsheets. Ontologies of the type N-N, N-V, V-N, etc are all different and the EF-Ratios may have different meanings. Even between N-N sequences we have to be very careful as a term can be used as both a noun or a verb (examples: "budget", "training") or as a broader, narrower or specific term.


It seems to me that translating the math into concepts that are readily graspable for others would fill workshops, sell SEO services and add to respect for what SEO actually does.



I agree. Having SEOs more educated and trained will certainly help them in the long run in the industry and within other professions and circles (academia, gov grant proposals, professional writers, etc).


For some people, all I have to do is say the words "Linux or Windows Server" and they are lost. For others, I can say things like "to help determine search engine results, spiders count terms and how important they seem to be, and add that to dozens of other factors." Then I get to explain that to speed searches (lol) during a search, the SE is relying on what spiders have ALREADY gathered - that part is not hard to digest, though some people will still glaze over if I insert the word "database" in the explanation.


I like to use the following analogy to explain EF-Ratios. Let say I have 3 boxes each with 4 compartments and containing apples and oranges, thus n12 = 3. Now let say I interconnect with a stick one apple and one orange from the first compartment. Let say I do the same in the second compartment of the first box.


How many boxes contain apples and oranges? 3. How many boxes have apples and oranges connected by a stick? 1, thus "n12" = 1. The EF-Ratio is 1/3, or 33%, which is a global ratio for the boxes.


Let's ask a different question now. In box #1, how many compartments do I have? 4. How many compartments contain apples interconnected to oranges? 2. The local EF-Ratio is 2/4 or 50%.


In this case, the boxes are documents and the compartments can be passages from the documents.


We can even redefine the local EF-Ratio to the desired granularity or heart needs by defining passages as text windows of a given length, as sentences, as every other sentence, paragraphs, etc in order to conduct more specific readability studies.


A nice tutorial is available at
http://www.miislita....s-tutorial.html


It is an eye opener.

#18 notsleepy

notsleepy

    New To Community

  • Members
  • 2 posts

Posted 04 October 2005 - 02:30 PM

Hi Dr. Garcia,
Thank you for a very enlightening and helpful explanation of C-indexes. I hope to have a chance to buy you a drink at the WMW conference in Vegas.

In your tutorial you give examples of global and local EF-ratios. In the example of "discount hotels" as well as the example for "hotels discount" your FINDALL results was 54,900,000. It seems that you used the FINDALL results from the search for discount hotels in both examples rather than getting a different FINDALL results number for hotels discount.

I'm curious if this was just a mistake or if you didn't realize that Google does produce different results for searches based on the ordering of keywords. In other words this statement about the search engines default mode isn't correct:

Searching in FINDALL mode is a lot easier since in most search engines this is the default mode, also known as AND. In this mode, the system returns documents containing all query terms in no particular order.


Given this, is it now useless to attempt to discover how natural or unnatural a sequence is by using EF-Ratios on indexed page counts in a search engine like Google?

#19 Guest_orion_*

Guest_orion_*
  • Guests

Posted 04 October 2005 - 04:12 PM

Hi, there.

No, that was not an error. It was done intentionally.

If you read the SEWF thread on Keywords Co-Occurrence or the series on c-indices at Mi Islita, it is explained very well this type of fluctuations and that search results in FINDALL often do not produce identical total number of results as expected, even though it is a search regardless for order or proximity. Here we look at relative results. There is a margin of relative error, which in most cases can be neglected in FINDALL results. This margin of relative error is greater in EXACT. The error itself affects all current keyword search tools and services outhere.

There are several reasons for this source of error:

a. search engines are constantly purging/upgrading results which may occur at the time one is querying the engine.
b. hitting different layers of the multitiered database.
c. searching at different times during the day can also produce different results.

Even with the margin of error, comparisons can be made to discriminate between combination of terms, especially if we conduct time series analysis using EF-Ratios. We are finishing a project on this to show how this is done.

Orion

#20 notsleepy

notsleepy

    New To Community

  • Members
  • 2 posts

Posted 04 October 2005 - 04:54 PM

search results in FINDALL often do not produce identical total number of results as expected, even though it is a search regardless for order or proximity


Forgive me if I misunderstand you but are you saying that Google's default search is FINDALL and the results are not affected by order of words in a search (excluding the sources of error you pointed out)?

I would disagree with that. Order and proximity are factors in Google search results.

#21 Guest_orion_*

Guest_orion_*
  • Guests

Posted 04 October 2005 - 05:07 PM

I agree with that. I'm not speaking in absolute terms. "Often" is not "always" or "must".

FINDALL is default mode which in theory should return results regardless for ordering. This does not mean that results cannot be affected by ordering. I mentioned this and even have given several examples here

http://www.miislita....ndex-3-old.html


Cheers

#22 Guest_orion_*

Guest_orion_*
  • Guests

Posted 04 October 2005 - 05:20 PM

And here, when I explained that c-indices are not Jaccard coefficients as some may think (though for two terms and under very special circumstances may be similar) I mention the importance of transposition in searches

http://www.miislita..../c-index-2.html


Which explains why we could not rely on Jaccard's Coefficients for searches.

Cheers

#23 Guest_orion_*

Guest_orion_*
  • Guests

Posted 07 October 2005 - 02:26 AM

There are two ways of elucidating candidate sequences. One method consists in conducting a temporal co-occurrence analysis (TCA) to monitor over time several sequences from a pool of terms. This approach works well, especially when we deal with long-term correlations over time as is the case of seasonal trends and sudden search results triggered by external events.

Another method I use consists in making a matrix of EF-Ratios by assuming all possible combinations for the ratios. I use this method when I need to determine the current text popularity state of a sequence.

If I search in EXACT mode for "discount hotels", documents containing the sequence may also be contained in the discount hotels or hotels discount FINDALL sets or both. I'm going to call these SET 1 and SET 2. Similarly, if I search in EXACT mode for "hotels discount" documents containing this sequence can be part of the discount hotels or hotels discount FINDALL sets or both. For a query consisting of two terms, I can inspect the likelihood of a combination by computing four different EF-Ratios.

The following figure shows recent search results in Google (10/06/05), where the four EF-Ratios were computed.

Posted Image

According to these results, "discount hotels" seems to be more targeted. This was also true on 02/15/05. Back then the relative deviation (deviation/mean) between the two FINDALL sets was very small and negligible even considering that Google returned far more results, about 25,000,000 to 30,000,000 more results. Where are those documents? I don't know. Ask Google. A seasonal reason, a filter or database upgrades/purging could account for these.


Cheers

#24 AbleReach

AbleReach

    Peacekeeper Administrator

  • Site Administrators
  • 6467 posts

Posted 07 October 2005 - 04:04 PM

Hello!

Sorry to take so long to contribute. I am in the midst of a move and the computers are still in boxes along with my sleep deprived mind. ;-)

Thank you Dr. Garcia and others. What a treat it is to make my way to a computer lab and find brain food.

One method consists in conducting a temporal co-occurrence analysis (TCA) to monitor over time several sequences from a pool of terms. This approach works well, especially when we deal with long-term correlations over time as is the case of seasonal trends and sudden search results triggered by external events.


Ahhh, ok. (Making note to look up TCA at later date.)
What are some good ways to use c-index when starting keyword research?

I usually ask how someone's customers would describe the web site owner's desired offerings, which leads to one set of phrases, then how peers would describe the same. Between the two lists I get a few general interest phrases, a few focused phrases, and a slew of possibly connected words. I make a list of all phrases as given, plus any that may cross over, then check to see what actually comes up on a SE. Tools like what's on SEOmoz help to speed the process.

One keyword example from someone I am currently working with is karate moves versus karate kata, or karate student versus karate-ka, or karate school versus karate dojo. These pairs mean basically the same thing, but show a different level of familiarity.

They are also all adjective+noun without commas, something I hadn't thought through before. Do noun, noun searchers think differently than phrase-users? When searching for a topic on my own I usually use commas - (most specific word) , (less specific word) , (common topical phrase) When guessing how others might search I use a phrase, usually a noun with qualifiers - eg early greek mythology.

I am assuming that c-index could help to sort out which terms are most likely to be used by browsers versus buyers. Is this true? Every site would need a mix - eventually you'd like browsers to convert to devotees or buyers.

Thanks to all, again,

Elizabeth

#25 Guest_orion_*

Guest_orion_*
  • Guests

Posted 07 October 2005 - 05:11 PM

Hi, Elizabeth

I know. I hate moving, too.

I am assuming that c-index could help to sort out which terms are most likely to be used by browsers versus buyers. Is this true? Every site would need a mix - eventually you'd like browsers to convert to devotees or buyers.


Absolutely. C-indices as EF-Ratios provide a way of comparing relative results rather than absolute counts. Indeed, the use of absolute counts can be misleading.

To illustrate, if we look at the previous table, hotels discount returned more results (about 5,000,000 more) than the more natural sequence discount hotels. When we compute the EF ratio, we can see which sequence is the more targeted one: discount hotels.

So, the idea of using absolute search counts either from search results or search volume is questionable.




Orion

#26 Guest_orion_*

Guest_orion_*
  • Guests

Posted 13 October 2005 - 07:30 PM

I usually ask how someone's customers would describe the web site owner's desired offerings, which leads to one set of phrases, then how peers would describe the same. Between the two lists I get a few general interest phrases, a few focused phrases, and a slew of possibly connected words. I make a list of all phrases as given, plus any that may cross over, then check to see what actually comes up on a SE. Tools like what's on SEOmoz help to speed the process.  



The simplest way of formulating c-index and ef-ratios is by thinking in terms of signal-to-noise ratios.


Both metrics involves dealing with word groups. In the c-index this involves unstructured sequences, while with ef-ratios we deal with structured sequences. The goal is the same in both cases: noise reduction from a pool of results. Venn Diagrams are a visual representation of such reduction.


Depending on the nature of the problem (e.g., temporal data triggered by specific events, a group sequence representing a brand, structural data, unstructured information, etc) we need to make a decision on which of the two metrics (c-indices or ef-ratios or both) we should use.


For multi-topic disambiguation or patent searches I would probably use c-indices. For a sequence associated to a trade brand, a trademark or a trade service consisting of common terms, I would probably use ef-ratios. For temporal data, my choice depends on the nature of the system.

#27 Guest_orion_*

Guest_orion_*
  • Guests

Posted 19 December 2005 - 03:36 AM

In a recent article, Co-Occurrence and the Scope of Terms I discuss new advances with c-index calculations, in particular on the emergence of apparent outliers and how these could be handled.

There are many reasons of why co-occurrence values between two terms can be either low or high. One reason is due to lack/presence of semantic associations between terms.

Another reason that affects co-occurrence is that external factors, such as natural disasters, world events, editorial guidelines, the launching of a new product or service, or seasonal trends can influence the number of documents indexed in a database (for blogosphere studies, add "meme" and other dynamic factors).

A third reason for obtaining low or high c12-index values is due to the nature of the terms involved, whether these are nouns verbs, articles, etc or whether we are dealing with term combinations of the form noun-noun, noun-verb, verb-noun, synonyms, homographs (same spelling, different meaning), etc.

Yet another reason is due to the scope of the terms. This article focused on this precisely. I wanted to single out this topic, since the scope of terms in co-occurrence analysis and word association studies is often overlooked.

Let me know what you think so we could elaborate on it here.

Happy Holidays, all.

Edited by orion, 19 December 2005 - 03:41 AM.




RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users