2 Pages V  1 2 >  
Reply to this topicStart new topic
> Yahoo! Superunits: of signatures and co-occurrence

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 20 2005, 09:55 PM
A Yahoo! patent application was published on April 14th which describes a "concept index," created from associations about the web in a somewhat new manner.

While I haven't spent the time to do a lengthy analysis, I thought it might be nice to throw it out here, and include a couple of quotes from the document.

The patent application is at:

Systems and methods for search processing using superunits

The motivation behind this new method appears to be to create an index that attempts to provide answers to queries based upon the way that people search.

QUOTE
What human beings think in terms of are natural concepts. For example, \"hawaii\" and \"new york city\" are vastly different queries in terms of length as measured by number of words but for a human being they share one important characteristic: they are each made up of one concept. In contrast, a person regards the query \"new york city law enforcement\" as fundamentally different because it is made up of two distinct concepts: \"new york city\" and \"law enforcement.

Human beings also think in terms of logical relationships between concepts. For example, \"law enforcement\" and \"police\" are related concepts since the police are an important agency of law enforcement; a user who types in one of these concepts may be interested in sites related to the other concept even if those sites do not contain the particular word or phrase the user happened to type. As a result of such thinking patterns, human beings by nature build queries by entering one or more natural concepts, not simply a variably long sequence of single words, and the query generally does not include all of the related concepts that the user might be aware of. Also, the user intent is not necessarily reflected in individual words of the query. For instance, \"law enforcement\" is one concept, while the separate words \"law\" and \"enforcement\" do not individually convey the same user intent as the words combined.


The abstract introduces some new vocabulary words - a concept network, a unit, a superunit, and signatures.

QUOTE
A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units, e.g., based on patterns of units that appear together in queries. From the concept network, various similarities between different units can be detected, and units that have some identifying characteristic(s) in common may be grouped into superunits. For each superunit, there is a corresponding signature that defines the identifying characteristic(s) of the group. A query can be processed by identifying constituent units, determining the superunit membership of some or all of the constituent units, and using that information to formulate a response to the query


Each entry in the page index described in the patent application includes a word, a link to a page, and some type of context identifier, which may help to provide a sense of how the word is used upon the page.

This quote was also interesting:

QUOTE
To establish an association between units, a minimum frequency of co-occurrence may be required. It should be noted that the units that are related by association need not appear adjacent to each other in queries and that the string obtained by concatenating associated units need not be a unit.


This patent sounds like one to look into in some more depth.

<edit - fixed typo>
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 9-January 05
Posts: 1,532
From: Perth, Western Australia
post Apr 21 2005, 03:07 AM
Mate,

Good find. Do you work in a patents office ?

Its amazing how early in the development of an application a company will try and protect these ideas.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 6-March 03
Posts: 7,962
From: Langley, British Columbia, Canada
post Apr 21 2005, 04:36 AM
Yes, Bill, that's extremely interesting and novel. Indeed it may be more useful than the Google central PageRank concept if I'm understanding it correctly.

Google attempts the task of cataloguing all web pages based only on the web pages. If there were no humans around Google could still provide SERP's. Yahoo! instead will use all those human searchers and their behaviour to provide patterns of meanings in web pages to determine relevance through meanings. Each one of us becomes an involuntary 'editor' in helping this Y! approach. It sounds a much more likely way to get relevant SERP's.

Travis, that's the dilemma in seeking patents. Do it too early and you may not grab all you need to grab and others may be able, from the patent which is public, to figure out how to grab what you've missed. Do it too late and the finding may inadvertently get out into the public domain and then you can no longer patent it. So earlier is usually better. After all, someone else out there may have the same thought at about the same time.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 21 2005, 08:20 AM
QUOTE
Do you work in a patents office ?


Thankfully, no. smile.gif

But, within patent applications are opportunities to find information from companies who are interested in creating and protecting their legal rights, but might not be willing to share that information in other manners.

We can speculate and experiment all we want, but when a Google or a Yahoo! has a patent application published, it can be helpful to see what words they use to describe what they are doing, what they might be doing, and what they want to exclude others from doing.

QUOTE
Indeed it may be more useful than the Google central PageRank concept if I'm understanding it correctly.


The pagerank concept had, and probvably still has it's uses. Looking at the way pages connect with each other does provide value, and considering different sites to have varying levels of trustworthiness is reasonable. But, pagerank by itself has some flaws, and there do seem to be ways to build upon it.

The recent Google Agerank patent also looks at user involvement with web pages - which pages are selected from a list of search results, which pages do people spend more time upon after conducting a query, which topics seem to be more popular, and which seem to be less popular.

But we haven't seen any documentation from them like this Yahoo! patent, which talks about indexing the senses that words are used within on a site, the changes in context of words based upon their surroundings, the grouping of items in a page index into concepts, and understanding their relationship to similar concepts, and the grouping of those concepts into larger ones.

One alternative to a patent is a trade Secret. And while that may work well for the secret formula for Coca-cola, or Kentucky Fried Chicken, it might not work so well for search engines.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 29-August 02
Posts: 5,751
From: Bristol, UK
post Apr 21 2005, 08:56 AM
So what they are talking about is a more advanced word association process then?

Trying to automatically create a heirachial structure of linked topics, much like the directory structure of something like DMOZ perhaps?

If they are trying to build a structure of linked topics like that, I imagine directories could be quite useful to the process. As well as the idea of themes within a web site.
Offline Go to the top of the page

Star Member

Group: Members
Joined: 24-February 05
Posts: 517
post Apr 21 2005, 10:32 AM
This patent is discussing a LocalRank-like methodology combined with click analysis. They are analyzing user queries by click results, which was always a stupid idea (DirectHit tried it and failed) because users don't always hit the BACK button when they get to a bad site. I usually open results in a new browser and close the browser window. That is because I don't like waiting for my browser to re-render the query results page.

The Google patent application deals with more than just TimeRank-stuff. It's concerned with distinguishing between natural relevancy and artificial relevancy.

This Yahoo! patent application uses "relatedness" for what Google usually calls "relevance".

One similarity between Yahoo!'s process and Google's process is that they are now both looking at classes of queries (sets or collections of queries) and analyzing them for user preferences. By comparing a new query to past queries, they are both hoping to determine more quickly exactly what the users are looking for.

And this patent application also indicates that, like Google, Yahoo! is now treating links as data structures. However, the Google model is more sophisticated. It looks to me like you'll be able to manipulate Yahoo!'s results by associating two or more classes of links with targeted anchor text with a specific document.

That is, to rank for "New York City law enforcement", you want links pointing to the page which say "New York City", "Law Enforcement", and "New York City law enforcement" as well as "new york city police" and so forth.

They will look for documents which are determined by inbound links to be relevant to multiple topics. I suspect they may be struggling with the Natural Relevance versus Artificial Relevance issue that afflicts Google, MSN, and all the search engines.

i.e., this is another anti-optimization optimization by a search engine. It will be interesting to see what MSN does in this area.
Offline Go to the top of the page

Member

Group: Members
Joined: 4-August 04
Posts: 17
From: Tri-Cities, Wash.
post Apr 21 2005, 12:55 PM
QUOTE
A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units, e.g., based on patterns of units that appear together in queries. From the concept network, various similarities between different units can be detected, and units that have some identifying characteristic(s) in common may be grouped into superunits. For each superunit, there is a corresponding signature that defines the identifying characteristic(s) of the group. A query can be processed by identifying constituent units, determining the superunit membership of some or all of the constituent units, and using that information to formulate a response to the query

I believe this is already in effect, though I have only one shred of evidence to support that.

Y's SERPs changed fairly dramatically in the recently announced update with regards to local real estate. I typically watch this sector and run a lot of "CITYNAME real estate" and "CITYNAME homes" searches.

CITYNAME and "real estate" would be labeled as two different concepts in this patent, like the "New York City law enforcement" example. My suspicion is that CITYNAME is a "superunit", as explained here:

QUOTE
A query can be processed by identifying constituent units, determining the superunit membership of some or all of the constituent units, and using that information to formulate a response to the query.

And I say that for this reason: as of the most recent Y update, my "CITYNAME real estate" searches are producing SERPs that are more heavily about the CITYNAME than about "real estate." (It frankly looks a lot like the G results post-November 2003 when directories and city guide pages dominated similar SERPs.)

All of this is a long way of saying that I think this is already in use. As soon as I read the snippets in the original post, I thought of these SERPs I've seen change lately.

Great find, bragadocchio.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 21 2005, 10:51 PM
Some excellent points.

It may be partially in use, Pleeker. There are lots of uses of the words "a variation of this..." in there, so we may have to try to figure out what may or may not be implemented if parts of it are.

A good quick analysis, Michael. I've spent half the night with the patent application now, and I've only made it through the first third or so. But, I can see how a co-occurrence analysis can possibly be a benefit under this patent. Those types of links you describe may help. Though ideally, links to and from pages which share similar concepts should contain associated words and phrases on the pages, and in links anyway.

I'm not quite ready to hold Google's method up against Yahoo!'s and say which is more sophisticated. All I can say clearly is that they are different enough so that what works on one doesn't necessarily have to work on the other. I do think that there may be more than just something like local rank in Yahoo!'s methology. Hopefully we can probe into that further.

QUOTE(Adrian)
So what they are talking about is a more advanced word association process then?


Yep. It looks more closely at how words and concepts are related, and defines different types of associations that may be weighed differently. A hierarchical tree structure like you find in DMOZ does do a fine job of indicating relationships between pages. Some references are made in the patent application to a combination of using spiders and crawlers, having manual human interaction, and also defining relationships by looking closely at queries.

Yahoo! already has a directory structure that it owns, and can use to help in its efforts.

I've made a start of translating this application, and that starts in my next post.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 21 2005, 11:08 PM
I've tried to put parts of this patent application into a simpler language. Like most patents and patent applications, it's long. I've completed about the first third or so of the "Detailed Description of the Invention" and have skipped over some of the early parts of that section.

Systems and Methods for Search Processing Using Superunits

The Search engine returns search results, media content and content in response to links selected in search result pages. It also records user queries in its log files.

One version of the Search Engine uses indexes filled with:[list]1. pages,
2. links to pages,
3. data representing the content of indexed pages, and
4. more. [list]This page index information is collected from crawlers, spiders, and human created or guided directory systems. This index can be part of the search engine, or in a separate system.


Context Identifiers

An entry in that page index includes:[list]a. a search term,
b. a link to a page on which that term appears and
c. a context identifier for the page. [list]The context identifier acts to group similar results for search terms that may have different meanings in different contexts.

For example, the search term "java" may refer to:[list]a. the Java computer language,
b. to the Indonesian island of Java, or
c. to coffee [list]The context identifier for a page indicates which context applies.

A page and a link to a page may have more than one context identifier, and may be displayed in more than one context.

The preference is to have automatic association of context identifiers with page links created as users perform related searches; but those associations to links may also be created and changed by human index editors. This combination of automatic and manual association enables the system to define and re-define contexts and improve results.


I. Concept Analysis


The Search Engine provides answers to a query, ranked by algorithms for concept analysis, which can combine:[list]1. logical relevance, as measured by patterns of occurrence of the search terms in the query;
2. context identifiers;
3. page sponsorship;
4. others[list]

A. Ambiguous Terms and Contexts


One version analyzes search queries or results or both, and displays results grouped in contexts. So, a search for "Java", may group results in a number of contexts or word senses (categories) that have been identified. For "java," that might be:[list]1, Java the computer language,
2. Java the island, and
3. coffee java. [list]1. linked sets for each category, or
2. links to understandably different categories that users can select to see the associated links for each. [list]1. pages from the search index,
2. links associated with sponsored matches,
3. links associated with directory matches, and;
4. links associated with Inside Yahoo! (IY) matches. [list]


B. Unambigous Terms and Contexts


While the search engine will work with words or phrases that have ambiguous meanings such as "Java", some grouping may happen for terms that aren't necessarily ambiguous. An example may be the results for a search using the "Hawaii". "Hawaii" by itself might not be ambiguous; but results for it could be very broad, including every site mentioning Hawaii.

The search engine might provide useful results by organizing them into categories (contexts). Results could be returned in various groupings such as:[list]1. "Hawaii: travel"
2. "Hawaii: climate"
3. "Hawaii: geography"
4. "Hawaii: culture"
5. others.[list]These context identifiers may be stored in the page index. Remember, an entry in the page index includes a search term, a link to a page on which that term appears, and one or more context identifiers for the page.


II. A Statement about the System


The patent application at this point asks us to keep in mind that the descriptions in the application are illustrative, and that a system developed from this application may be different than what it is describing. The system could be limited to a small area, or part of a widely distributed network. It could use more than one set of indexes and algorithms for providing results, and it could include information from other sources.


III. Concept Networks and Superunits


A. Recognizing Concepts


One version of this system uses algorithms to analyze concepts related to search terms to return relevant results.

For example, searching for "New York City," a user is probably interested in sites about New York City instead of a city in New York. And, in a search for "New York City law enforcement," the user is probably wants sites about law enforcement in New York City.

Most search engines would search looking for sites that contained the individual words in that query, regardless of their order. Other search engines might look for the longest string of adjacent words from the search phrase which also appears in their index. So, if that index contained "New York", "New York City" and "New York City law" but not "New York City law enforcement", that search engine would use "New York City law" and "enforcement" in its search, probably not returning results that the searcher was looking for.

The system in this patent application would, when faced with "New York City law enforcement," recognize the concepts "New York City" and "law enforcement" and return results for these two concepts.



i. Using the Order of Terms to Recognize Concepts


The system could use the order of terms within a search query to identify the concepts that make it up. so, this system could hash together "New York City" and "law enforcement" as concepts in the query and return results for those concepts. And it's possible that the same results would be returned for "law enforcement in New York City."

But, for "city law enforcement in New York," the concepts "law enforcement" and "New York" and "city," or "city law enforcement" and "New York" might be used. And, "enforcement of law in New York City" could include the concepts "New York City," "law" and "enforcement."

The order of concepts isn't as important as the order of terms that make up a concept.


ii. Using Unit Dictionaries and Concept Discovery to Recognize Concepts


In one version of this system, concepts would be included in the page index, for instance, as terms and/or context identifiers. Or it could use a separate concept index. Or it could use both, so that "law enforcement" might understood as the same as "enforcement of law" or not depending on the context. A concept within a query could be detected by using a "unit dictionary" containing a list of known concepts (or "units").


A Unit dictionary could be created by using information from a large number of previous queries, preferably at least several hundred thousand. This type of creation is referred to as "concept discovery" and uses an analysis of those previous queries to build a concept network. It may be performed by the search service or by another server.


B. Concept Networks


A "concept network" refers to any set of relationships among concepts.

Each concept or unit (e.g., "New", "York", "New York City", etc.) is a "node" of the network and is connected to other nodes by "edges" representing relationships between concepts.

A concept network can define different types of relationships. Relationships include:[list]1. extensions ("ext"),
2. associations ("assoc"),
3. alternatives ("alt"); and,
4. other relationships could also be defined additionally, or in place of those.[list]


i. Extensions as Relationships

An extension is a relationship between two concepts or units that exists when the string obtained by concatenating the two concepts or units is also a concept or unit.

Example: the string obtained by concatenating units "new york" and "city" is "new york city," which is also a unit.


ii. Associations as Relationships

An association is a relationship that exists between two concepts or units that appear in queries together.

Example: the word (unit) "hotels" can be associated with (the unit) "new york" and it can be associated with (the unit) "new york city". Pairs of associated units are referred to as "neighbors," and the "neighborhood" of a unit is the set of its neighbors.

An association between units may require a certain amount of co-occurrence. In other words, compare the amount of times the units don't appear together in a query, and the amount of times that they do, and if they "co-occur" in a large enough amount of queries, they are associated. Keep in mind that they do not have to be next to (adjacent to) each other to be related by association.

Also if you place those words or concepts next to each other in a string of associated units, they don't need to make up a new unit. But if they do, then an extension relationship would also exist. So, an extension relationship is really a special kind of an association relationship. For instance, the words "dog" and "Pound" are probably associated because they may just appear in a large number of queries together. And, placing them next to each other also shows an extension relationship - "dog pound."


iii. Alternatives as Relationships

An alternative relationship is when you have a word or concept and a different form of the same expression. These can be broken down into:[list]1. preferred,
2. corrected, or
3. a variation of that first unit. [list]1. "motel" and "hotel."
2. "brittany spears" and "britney spears" (different spellings), or
3. "belgian" and "belgium" (different parts of speech).[list]A "preferred" alternative unit would be the one that shows up more frequently. A correctly spelled alternative might be the preferred altenative unit. Words that differ only on the basis of captialization aren't normally alternatives under this system, since its case insensitive, but for other versions of the search engine case may matter units that differ only in capitalization maybe considered as alternatives.


iv. Showing the Strength of Relationships by Assigning Weights to Edges

In our concept network, each concept or unit is a "node" and is connected to other nodes by "edges" which represent the relationships between those concepts. The edges in the concept network can be assigned weights, which could be numerical values representing the relative strength of the different relationships.

Example of the use of weights for relationships,

The weight of an edge (relationship) between a first unit and an associated unit (keep in mind that this is just one of the different types of relationships) may be based on taking the number of all searches which include the first unit, and looking at the fraction of searches which also contain the associated unit. Or it could look at all of the searches which contain either unit, and looking at the percentage of those which contain both.


These types of weights can show how strong the relationship is. The weights may be normalized (some examples of normalization are here: http://www.datamodel.org/NormalizationRules.html ).


The example above may be what is used to show the weight of a relationship between words (units or concepts) that have association relationship. The other types of relationships may be given weights in different ways. A "concept network" includes all of those different ways of assigning weights to different relationships.


Next: Supernodes...
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 21 2005, 11:34 PM
The Yahoo! patent application uses the term co-occurrence a number of times. Orion, from SEW forums, started a great thread on the subject a while back, and it's one of the most visited threads on an SEO forum anywhere.

It's a topic worth looking more closely at, and if you haven't read the thread, it's worth spending some time there going through it.

See: Keywords Co-occurrence and Semantic Connectivity

You may want to start with the examples in this post:

http://forums.searchenginewatch.com/showpo...492&postcount=9
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 9-January 05
Posts: 1,532
From: Perth, Western Australia
post Apr 22 2005, 07:15 AM
Mate,

Thats a fantastic post. I have a client at the moment tossing up whether to use one of

(a) Engineering Training
(cool.gif Engineering Workshops
© Engineering Seminars
(d) Engineering Courses

Based on the semantic connectivity index presented, which one should we choose and why. I dont understand where he got those large numbers from ?

Should we not just type these into the overture tool and choose the most popular one ?

And we just picked two Engineering Training web design contracts in Perth, so I have to promote both of them. Although they are not direct competitors (one is purely mechanical) I will probably apply a similar phrase to both from the above 4 choices.

The post by Orion is excellent. A few people had a crack at him about SEO not being a science, but thats totally wrong. Anything you can analyse and record data about is a science.

Good to see some people taking it more seriously and trying to present SEO in a more scientific context.

The only improvements we have enjoyed on Yahoo and Google is when we applied a more scientific approach to our designs in accordance to what we thought was important in each case.

These patent applications tend to be a nice hindsight look at what we have been trying to understand.
Offline Go to the top of the page

Star Member

Group: Members
Joined: 24-February 05
Posts: 517
post Apr 22 2005, 09:30 AM
QUOTE
(a) Engineering Training 
(cool.gif Engineering Workshops 
© Engineering Seminars 
(d) Engineering Courses 

Based on the semantic connectivity index presented, which one should we choose and why. I dont understand where he got those large numbers from ?


Ignore the semantic connectivity index. Focus on what people are searching for. The search engine technologies these patent applications represent are still in their infancy.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 24 2005, 01:21 AM
I'm not sure that I can agree with you Michael.

There may be a value in looking at the co-occurrence of the terms regardless of whether or not this patent has be implemented. It can give a sense of how the present indexing system in place finds each of those terms the most relevant.

There's also a value in gaining a sense of how the technologies described in the patent would work, and how to make a decision like the one that Travis is trying to make based upon that technology.

Sure the technology described in the patent application is still in its infancy, but so is the rest of the web. And the ideas in the application didn't spring out of someone's head fully formed. A fair amount of people have been giving them a fair amount of thought. So it makes sense to try to figure out what the patent application may mean, and how to use a co-occurrence analysis to see what Yahoo!'s indexing programs think of the relationships between those different phrases, regardless of the technology in place.

Looking closer at the patent, it's only one of at least three that Yahoo! has filed that cover these concepts. Here are the other two:

Systems and methods for generating concept units from search queries

QUOTE(Travis)
Based on the semantic connectivity index presented, which one should we choose and why.  

Should we not just type these into the overture tool and choose the most popular one ?


The overture tool will only tell you which of those terms might have been searched for the most over a short period of time. It can be useful in determining whether or not people are searching for those terms.


The formula that Orion gave was: c = n12/(n1 + n2 - n12)

About the formula.

The idea behind this is to simply get a sense of how many times these keywords appear in the index (roughly, in the same results) compared to how many times the keywords appear in the index in total.

So, n12 is the number of times the keywords appear together in results. n1 is the number of times that the one keyword appears in results. It is important that your n1 be the same for a comparison like this. n2 is the number of times the second keyword appears in results.

The reason why we subtract n12 from n1 + n2 in the formula is to not count those results where they appear together twice.

Let's test those phrases in Yahoo! using the formula to see which has the highest c-index



k1=engineering = 146,000,000
k2= training = 382,000,000
k12=engineering training = 36,200,000
c= 36,200,000/(146,000,000 + 382,000,000 - 36,200,000)
c= 36,200,000/491,800,000
c=0.0736 or 73.6 ppt

k1=engineering = 146,000,000
k2= workshops = 48,500,000
k12=engineering workshops = 4,900,000
c= 4,900,000/(146,000,000 + 48,500,000 - 4,900,000)
c= 4,900,000/189,600,000
c=0.0258 or 25.8 ppt

k1=engineering = 146,000,000
k2= seminars = 41,100,000
k12=engineering seminars = 5,000,000
c= 5,000,000/(146,000,000 + 41,100,000 - 5,000,000)
c= 5,000,000/182,500,000
c=0.0274 or 27.4 ppt

k1=Engineering = 146,000,000
k2= courses = 119,000,000
k12= engineering courses = 17,200,000
c= 17,200,000/(146,000,000 + 119,000,000 - 17,200,000)
c= 17,200,000/247,800,000
c=0.0694 or 69.4 ppt


Our results (in parts per thousand - to make it easy to compare the numbers):

engineering training - 73.6 ppt
engineering courses - 69.4 ppt
engineering seminars - 27.4 ppt
engineering workshops - 25.8 ppt

So, the words engineering and training appear together more frequently in documents in Yahoo!'s index than engineering and courses, and engineering and seminars, and then engineering and workshops.

Does this mean that more people might search for "engineering training" that "engineering workshops?" Don't know. But we do know that there is a greater percentage of documents in Yahoo!'s index where engineering and training appear within the same document that Engineering and workshops.

It is important that one of the keywords is the same here. We use "Engineering" in all of these. If we were using completely different sets of keywords, the comparision wouldn't be worth anything. We know from this comparison, that choosing between the second words in the phrases that training shows up in a higher frequency with engineering than any of the other words.

Now, that's looking at what Yahoo! has for these phrases. Keep in mind that Google will have something different. And MSN and ask Jeeves.
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 9-January 05
Posts: 1,532
From: Perth, Western Australia
post Apr 24 2005, 02:11 AM
Mate,

Thats Gold.

Who do I make the cheque out to - B.Slawski or Broggodacchio ?

If only you had a dollar for every post you ever made (You would have over 10,000 dollars!)

But seriously, the algorithms of search engines are devised in such a way to make the mathematical calculation at the last stage very simple. For example - using vector algebra. Half the problem with search engine companies is getting their required issue and data into a vector or matrix format where the world of linear algebra opens up some really nice tools.

The number of calculations per unit time can then be revved to very high levels giving the search engines their intrinsic magic speed.

So this model would fit nicely into a large calculation scheme. Thanks for the advice Bill. I missed the part on where they got the numbers from. I have already started implementing this idea and you should see the results (of both engineering sites) in the website hospital in the coming weeks. And then three months after that in the search engine section.
Offline Go to the top of the page

Moderator

Group Icon
Group: Moderators
Joined: 6-March 03
Posts: 7,962
From: Langley, British Columbia, Canada
post Apr 24 2005, 05:39 AM
This is very intriguing, but I see an additional complexity here. OK the c index shows us how often the two words will turn up somewhere within the same web page. So Bill put forward the following values for the c index.

engineering training - 73.6 ppt
engineering courses - 69.4 ppt
engineering seminars - 27.4 ppt
engineering workshops - 25.8 ppt

You can ask a different question. How often does the two word phrase appear in web pages? This of course is only a subgroup of all the web pages where the two words appear either as the two word phrase or as separate words somewhere in the web page.

Doing a search in Yahoo for the two word phrases you get the following numbers of web pages. I've listed the four phrases in Bill's order of the c index.

"engineering training" - 180,000 web pages
"engineering courses" - 275,000 web pages
"engineering seminars" - 7,080 web pages
"engineering workshops" - 22,600 web pages

So I'm left with the question - which is the most useful way of looking at these results in thinking about the words to use and what are the SEO implications?
Offline Go to the top of the page

Star Member

Group Icon
Group: 1000 Post Club
Joined: 9-January 05
Posts: 1,532
From: Perth, Western Australia
post Apr 24 2005, 06:07 AM
Good point Bwelford,

We have actually started development with "Engineering Training Courses" as the main target phrase for the client.

That should keep everybody happy. I dont whether the spacing of the words has any negative effect on the ranking.

Appreciate the extra analysis.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 24 2005, 09:51 PM
QUOTE(Travis)
But seriously, the algorithms of search engines are devised in such a way to make the mathematical calculation at the last stage very simple.


That's an excellent point, Travis. These things can get pretty complex, and involve a great number of calculations. I'm looking forward to hearing your about your results.

QUOTE(bwelford)
How often does the two word phrase appear in web pages?


Important question, Barry. The patent does look at a number of different relationships. The co-occurrence concept we've looked at is one that might fit best under the "association" relationship described in the patent application.

And, looking at the number of pages where the phrase itself appears can be meaningful, too. I believe that Orion describes this in his discussion of co-occurrence at SEW.

We also want to consider these phrases under an analysis that involves an "extension" relationship. As I wrote above:

QUOTE
Also if you place those words or concepts next to each other in a string of associated units, they don't need to make up a new unit. But if they do, then an extension relationship would also exist. So, an extension relationship is really a special kind of an association relationship. For instance, the words \"dog\" and \"Pound\" are probably associated because they may just appear in a large number of queries together. And, placing them next to each other also shows an extension relationship - \"dog pound.\"


How might you determine whether or not a phrase has some semantic connection? Possibly by looking at the ratio of results where the exact phrase appears (an exact search using quotation marks) in the pages where both words appear (a findall result - which is what many search engines return when you enter a phrase without quotation marks. where the search engine looks for keyword1 and keyword2 and keyword3, etc.)

Some of the math, and some of the potential pitfalls of this approach are more fully covered here:

Overlapping Patterns: EF-Ratios, Separators, Patterns and Pitfalls
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Apr 24 2005, 10:09 PM
In the post above where I attempted to interpret parts of the patent application, I left off without writing about superunits. It's an important part of the application, so I wanted to make sure that I included it in as much depth before moving on to the other parts of the application.


Superunits


One relationship in the concept network is the membership of units in "superunits."

A superunit is a set of units with an identified common characteristic. That common characteristic may include multiple elements, and uses a "signature" to determine whether another unit belongs in the superunit.

The degree of similarity between a unit's characteristics and the signature's characteristics could be used as a membership weight with a certain threshold in that weight needed for a unit to be considered a member of a superunit.


Example:

One superunit could be made up of cities:

New York City,
San Francisco,
Chicago,
others

Its signature could include other units that frequently appear in searches along with the name of a city, such as:

hotel,
museum,
mayor,
jobs,
etc.

To see if a new unit (possibly another city) is a member of the superunit of cities, that new units associations are compared to the associations in the signature.


Another Example

A superunit may be made up of units that are alternatives for each other:

britney spears,
brittany spears,
britney speers,
etc.

The signature for that superunit might include units associated with the singer's name:

1. photos,
2. mp3,
3. tour,
4. others

A parameter (element) using an "edit distance" indicating similarity in spelling might also be used. A unit that has similar associations but a large edit distance (such as "barbra streisand" or "celine dion") wouldn't be included, while other misspellings of Britney Spears would be. This is covered in more depth later in the patent application.

Like other relationships of units, the unit dictionary is where superunit signatures and superunit membership information and membership weights for various units are stored.

Not every element in a signature for a superunit will have the same weight, and weights will be assigned to different elements to try to achieve the most relevant results.

The Search engine uses superunit information in response to searches, by determining which superunits the units in a search belong to and those units to the signatures of these superunits to determine the likely intent of the user.

The search engine also can use this information about likely user intent to organize results and suggested related searches.


The patent application starts getting a little more detailed from here. I've started going through it, but it will likely take some time to go through the many examples and illustrations.
Offline Go to the top of the page

Untested

Group: Members
Joined: 2-May 05
Posts: 2
post May 2 2005, 03:04 PM
New patent from yahoo: Strange, just a few days after Google decides on releasing a patent, yahoo decides to do the same.... mmmh... sounds more like viral marketing to me than anything else. What is fantasy, what is actual features currently developed by Yahoo, who knows?

But I am still waiting for results proving that speculation on latent semantic analysis is used by google for instance.

=> co-occurence.

I have not read anywhere (yet?) that semantic analysis was used by google, or other search engines. Co-occurence factors are used in this context only. So...why should I bother?
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post May 2 2005, 10:49 PM
Hi Chinese,

Welcome to the forum.

Only the people at Yahoo! really know for sure what they are using, just as only the people at Google know for sure what they are using.

The Google patent Application was made public on March 31st. And the Yahoo patent application was made public on April 14th. So yes, only a couple of weeks separate the dates that these patents could have be noticed by the public.

Using a patent application as a part of viral marketing would be a potential expensive, and potentially embarassing way of advertising. I do think that it is important that we ask what value each document has. I also think that it's important that we take a look at the methods described, and ask if its worth trying to deconstruct them, and see if the concepts included have value in learning and understanding.

I'd say that regardless of whether or not Google or Yahoo! are using those technologies, it's still worth learning about the methods described, and ignoring them is something that people interested in understanding the search engines should do at their own risk.

A patent doesn't have to be an indication of what a search engine is actually using, though it is great to get something to look at that is written by the people who work for the search engines. It's at least as worth of study as all of the speculation that circulates around the web. The ideas and concepts behind each patent application are well thought out, and show possible methods for indexing the web using a search engine. Not only that, but if those patent applications become actual patents, they enable each company to exclude others from using the technology described.

The ideas presented in the applications cover a wide gamut of notions about how a search engine could work, from Google's look at how historical data could be used to determine credibility and authenticity and relevance, to Yahoo!'s methodology for contemplating how words entered in searchers' queries may exhibit some type of relationship between them, and how exploring those queries and the relationships between them can be used to augment other methods of indexing, including using a human edited directory, and might make search results more relevant.

QUOTE
But I am still waiting for results proving that speculation on latent semantic analysis is used by google for instance.


You know, of course, that much of the speculation tied to whether Google is using some sort of latent semantic analysis was fueled by their acquisition of the company Applied Semantics a couple of years back, and chances are good that some of that technology has probably been used to develop their contextual advertising. Has it been included in the indexing and presentation of search results? We can't be sure. But, you don't stand a chance of knowing or not knowing unless you actually research how Latent semantic indexing actually works.

I'm not sure whom it is that you would expect to issue proof that Google is using that type of indexing in their search results. Google wouldn't be the ones to come out and say that they are or aren't. There's no real benefit for the search engines to come out directly, and explain exactly what they are doing. But, there is a benefit to them to issue a patent to protect their intellectual property.

Failure to apply for a patent in a timely manner can keep an inventor from ever getting a patent on the material. Such a failure would also keep them from excluding others from using that material. There is a benefit to issuing a patent other than just some press, or a handful of discussions in a very few forums.

There's also a benefit to learning about and understanding co-occurrence. I'm not sure if you spend much time with the search engines, performing searches, and experimenting with them, just to see what types of results you get back from the attempts. Or if you care much about how words are distributed around the web, and how the different search engines will index those words differently.

Co-occurence can be used as a tool to help understand how a search engine may be working. You really don't need to believe it is something that a search engine is or isn't using at this point to get a benefit from understanding it, and seeing how it can be applied to understand how a search engine indexes words.

Regardless of what a search engine does, it does make sense that when you have a choice of phrases which share at least one word, and you can see that the words which make up one of the phrases tend to appear together with a much higher frequency that others of the phrases, that there is probably some type of meaningful connection between those words.

Of course, you don't have to bother.

You don't need to read these patents, and try to understand them. You don't need to try to make sense of latent semantic indexing or co-occurrence. When the search engines issue patents, you could just believe that it's a publicity stunt, and ignore it. That's certainly your perogative.

There is a lot of misinformation, and disinformation on the web. Believe what you want at your own peril. For instance, there are lots of documents on the web that explain how to set up meta tags on your pages for best effect. I've seen hundreds that recommend using a "revisit after" tag, even though the only search engines to really use it include a third tier one that you may not have hear of, and the inventor of the tags -- a regional one in British Columbia, and they've given up on using it.

When a search engine sends out information about potential ways to index the material on the web, it really doesn't matter if it is something that they are presently using, something that they may use, or if it is something that they will never use. Understanding that information, and possessing the ability to see how it could be used, and may be used, has a fair amount of value. It allows you to make informed and educated guesses in the absence of insider knowledge. Sure, view it with a rational amount of skepticism. But ignoring it as an advertising stunt without taking the effort to understand it is something that I'm not ready to do.
Offline Go to the top of the page
Reply to this topic Start new topic
2 Pages V  1 2 >
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 10:39 AM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed