![]() ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 20 2005, 09:55 PM |
|
|
A Yahoo! patent application was published on April 14th which describes a "concept index," created from associations about the web in a somewhat new manner.
While I haven't spent the time to do a lengthy analysis, I thought it might be nice to throw it out here, and include a couple of quotes from the document. The patent application is at: Systems and methods for search processing using superunits The motivation behind this new method appears to be to create an index that attempts to provide answers to queries based upon the way that people search. QUOTE What human beings think in terms of are natural concepts. For example, \"hawaii\" and \"new york city\" are vastly different queries in terms of length as measured by number of words but for a human being they share one important characteristic: they are each made up of one concept. In contrast, a person regards the query \"new york city law enforcement\" as fundamentally different because it is made up of two distinct concepts: \"new york city\" and \"law enforcement.
Human beings also think in terms of logical relationships between concepts. For example, \"law enforcement\" and \"police\" are related concepts since the police are an important agency of law enforcement; a user who types in one of these concepts may be interested in sites related to the other concept even if those sites do not contain the particular word or phrase the user happened to type. As a result of such thinking patterns, human beings by nature build queries by entering one or more natural concepts, not simply a variably long sequence of single words, and the query generally does not include all of the related concepts that the user might be aware of. Also, the user intent is not necessarily reflected in individual words of the query. For instance, \"law enforcement\" is one concept, while the separate words \"law\" and \"enforcement\" do not individually convey the same user intent as the words combined. The abstract introduces some new vocabulary words - a concept network, a unit, a superunit, and signatures. QUOTE A concept network is generated from a set of queries by parsing the queries into units and defining various relationships between the units, e.g., based on patterns of units that appear together in queries. From the concept network, various similarities between different units can be detected, and units that have some identifying characteristic(s) in common may be grouped into superunits. For each superunit, there is a corresponding signature that defines the identifying characteristic(s) of the group. A query can be processed by identifying constituent units, determining the superunit membership of some or all of the constituent units, and using that information to formulate a response to the query Each entry in the page index described in the patent application includes a word, a link to a page, and some type of context identifier, which may help to provide a sense of how the word is used upon the page. This quote was also interesting: QUOTE To establish an association between units, a minimum frequency of co-occurrence may be required. It should be noted that the units that are related by association need not appear adjacent to each other in queries and that the string obtained by concatenating associated units need not be a unit. This patent sounds like one to look into in some more depth. <edit - fixed typo> |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 21 2005, 11:08 PM |
|
|
I've tried to put parts of this patent application into a simpler language. Like most patents and patent applications, it's long. I've completed about the first third or so of the "Detailed Description of the Invention" and have skipped over some of the early parts of that section.
Systems and Methods for Search Processing Using Superunits The Search engine returns search results, media content and content in response to links selected in search result pages. It also records user queries in its log files. One version of the Search Engine uses indexes filled with:[list]1. pages, 2. links to pages, 3. data representing the content of indexed pages, and 4. more. [list]This page index information is collected from crawlers, spiders, and human created or guided directory systems. This index can be part of the search engine, or in a separate system. Context Identifiers An entry in that page index includes:[list]a. a search term, b. a link to a page on which that term appears and c. a context identifier for the page. [list]The context identifier acts to group similar results for search terms that may have different meanings in different contexts. For example, the search term "java" may refer to:[list]a. the Java computer language, b. to the Indonesian island of Java, or c. to coffee [list]The context identifier for a page indicates which context applies. A page and a link to a page may have more than one context identifier, and may be displayed in more than one context. The preference is to have automatic association of context identifiers with page links created as users perform related searches; but those associations to links may also be created and changed by human index editors. This combination of automatic and manual association enables the system to define and re-define contexts and improve results. I. Concept Analysis The Search Engine provides answers to a query, ranked by algorithms for concept analysis, which can combine:[list]1. logical relevance, as measured by patterns of occurrence of the search terms in the query; 2. context identifiers; 3. page sponsorship; 4. others[list] A. Ambiguous Terms and Contexts One version analyzes search queries or results or both, and displays results grouped in contexts. So, a search for "Java", may group results in a number of contexts or word senses (categories) that have been identified. For "java," that might be:[list]1, Java the computer language, 2. Java the island, and 3. coffee java. [list]1. linked sets for each category, or 2. links to understandably different categories that users can select to see the associated links for each. [list]1. pages from the search index, 2. links associated with sponsored matches, 3. links associated with directory matches, and; 4. links associated with Inside Yahoo! (IY) matches. [list] B. Unambigous Terms and Contexts While the search engine will work with words or phrases that have ambiguous meanings such as "Java", some grouping may happen for terms that aren't necessarily ambiguous. An example may be the results for a search using the "Hawaii". "Hawaii" by itself might not be ambiguous; but results for it could be very broad, including every site mentioning Hawaii. The search engine might provide useful results by organizing them into categories (contexts). Results could be returned in various groupings such as:[list]1. "Hawaii: travel" 2. "Hawaii: climate" 3. "Hawaii: geography" 4. "Hawaii: culture" 5. others.[list]These context identifiers may be stored in the page index. Remember, an entry in the page index includes a search term, a link to a page on which that term appears, and one or more context identifiers for the page. II. A Statement about the System The patent application at this point asks us to keep in mind that the descriptions in the application are illustrative, and that a system developed from this application may be different than what it is describing. The system could be limited to a small area, or part of a widely distributed network. It could use more than one set of indexes and algorithms for providing results, and it could include information from other sources. III. Concept Networks and Superunits A. Recognizing Concepts One version of this system uses algorithms to analyze concepts related to search terms to return relevant results. For example, searching for "New York City," a user is probably interested in sites about New York City instead of a city in New York. And, in a search for "New York City law enforcement," the user is probably wants sites about law enforcement in New York City. Most search engines would search looking for sites that contained the individual words in that query, regardless of their order. Other search engines might look for the longest string of adjacent words from the search phrase which also appears in their index. So, if that index contained "New York", "New York City" and "New York City law" but not "New York City law enforcement", that search engine would use "New York City law" and "enforcement" in its search, probably not returning results that the searcher was looking for. The system in this patent application would, when faced with "New York City law enforcement," recognize the concepts "New York City" and "law enforcement" and return results for these two concepts. i. Using the Order of Terms to Recognize Concepts The system could use the order of terms within a search query to identify the concepts that make it up. so, this system could hash together "New York City" and "law enforcement" as concepts in the query and return results for those concepts. And it's possible that the same results would be returned for "law enforcement in New York City." But, for "city law enforcement in New York," the concepts "law enforcement" and "New York" and "city," or "city law enforcement" and "New York" might be used. And, "enforcement of law in New York City" could include the concepts "New York City," "law" and "enforcement." The order of concepts isn't as important as the order of terms that make up a concept. ii. Using Unit Dictionaries and Concept Discovery to Recognize Concepts In one version of this system, concepts would be included in the page index, for instance, as terms and/or context identifiers. Or it could use a separate concept index. Or it could use both, so that "law enforcement" might understood as the same as "enforcement of law" or not depending on the context. A concept within a query could be detected by using a "unit dictionary" containing a list of known concepts (or "units"). A Unit dictionary could be created by using information from a large number of previous queries, preferably at least several hundred thousand. This type of creation is referred to as "concept discovery" and uses an analysis of those previous queries to build a concept network. It may be performed by the search service or by another server. B. Concept Networks A "concept network" refers to any set of relationships among concepts. Each concept or unit (e.g., "New", "York", "New York City", etc.) is a "node" of the network and is connected to other nodes by "edges" representing relationships between concepts. A concept network can define different types of relationships. Relationships include:[list]1. extensions ("ext"), 2. associations ("assoc"), 3. alternatives ("alt"); and, 4. other relationships could also be defined additionally, or in place of those.[list] i. Extensions as Relationships An extension is a relationship between two concepts or units that exists when the string obtained by concatenating the two concepts or units is also a concept or unit. Example: the string obtained by concatenating units "new york" and "city" is "new york city," which is also a unit. ii. Associations as Relationships An association is a relationship that exists between two concepts or units that appear in queries together. Example: the word (unit) "hotels" can be associated with (the unit) "new york" and it can be associated with (the unit) "new york city". Pairs of associated units are referred to as "neighbors," and the "neighborhood" of a unit is the set of its neighbors. An association between units may require a certain amount of co-occurrence. In other words, compare the amount of times the units don't appear together in a query, and the amount of times that they do, and if they "co-occur" in a large enough amount of queries, they are associated. Keep in mind that they do not have to be next to (adjacent to) each other to be related by association. Also if you place those words or concepts next to each other in a string of associated units, they don't need to make up a new unit. But if they do, then an extension relationship would also exist. So, an extension relationship is really a special kind of an association relationship. For instance, the words "dog" and "Pound" are probably associated because they may just appear in a large number of queries together. And, placing them next to each other also shows an extension relationship - "dog pound." iii. Alternatives as Relationships An alternative relationship is when you have a word or concept and a different form of the same expression. These can be broken down into:[list]1. preferred, 2. corrected, or 3. a variation of that first unit. [list]1. "motel" and "hotel." 2. "brittany spears" and "britney spears" (different spellings), or 3. "belgian" and "belgium" (different parts of speech).[list]A "preferred" alternative unit would be the one that shows up more frequently. A correctly spelled alternative might be the preferred altenative unit. Words that differ only on the basis of captialization aren't normally alternatives under this system, since its case insensitive, but for other versions of the search engine case may matter units that differ only in capitalization maybe considered as alternatives. iv. Showing the Strength of Relationships by Assigning Weights to Edges In our concept network, each concept or unit is a "node" and is connected to other nodes by "edges" which represent the relationships between those concepts. The edges in the concept network can be assigned weights, which could be numerical values representing the relative strength of the different relationships. Example of the use of weights for relationships, The weight of an edge (relationship) between a first unit and an associated unit (keep in mind that this is just one of the different types of relationships) may be based on taking the number of all searches which include the first unit, and looking at the fraction of searches which also contain the associated unit. Or it could look at all of the searches which contain either unit, and looking at the percentage of those which contain both. These types of weights can show how strong the relationship is. The weights may be normalized (some examples of normalization are here: http://www.datamodel.org/NormalizationRules.html ). The example above may be what is used to show the weight of a relationship between words (units or concepts) that have association relationship. The other types of relationships may be given weights in different ways. A "concept network" includes all of those different ways of assigning weights to different relationships. Next: Supernodes... |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 21 2005, 11:34 PM |
|
|
The Yahoo! patent application uses the term co-occurrence a number of times. Orion, from SEW forums, started a great thread on the subject a while back, and it's one of the most visited threads on an SEO forum anywhere.
It's a topic worth looking more closely at, and if you haven't read the thread, it's worth spending some time there going through it. See: Keywords Co-occurrence and Semantic Connectivity You may want to start with the examples in this post: http://forums.searchenginewatch.com/showpo...492&postcount=9 |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 24 2005, 01:21 AM |
|
|
I'm not sure that I can agree with you Michael.
There may be a value in looking at the co-occurrence of the terms regardless of whether or not this patent has be implemented. It can give a sense of how the present indexing system in place finds each of those terms the most relevant. There's also a value in gaining a sense of how the technologies described in the patent would work, and how to make a decision like the one that Travis is trying to make based upon that technology. Sure the technology described in the patent application is still in its infancy, but so is the rest of the web. And the ideas in the application didn't spring out of someone's head fully formed. A fair amount of people have been giving them a fair amount of thought. So it makes sense to try to figure out what the patent application may mean, and how to use a co-occurrence analysis to see what Yahoo!'s indexing programs think of the relationships between those different phrases, regardless of the technology in place. Looking closer at the patent, it's only one of at least three that Yahoo! has filed that cover these concepts. Here are the other two: Systems and methods for generating concept units from search queries QUOTE(Travis) Based on the semantic connectivity index presented, which one should we choose and why. Should we not just type these into the overture tool and choose the most popular one ? The overture tool will only tell you which of those terms might have been searched for the most over a short period of time. It can be useful in determining whether or not people are searching for those terms. The formula that Orion gave was: c = n12/(n1 + n2 - n12) About the formula. The idea behind this is to simply get a sense of how many times these keywords appear in the index (roughly, in the same results) compared to how many times the keywords appear in the index in total. So, n12 is the number of times the keywords appear together in results. n1 is the number of times that the one keyword appears in results. It is important that your n1 be the same for a comparison like this. n2 is the number of times the second keyword appears in results. The reason why we subtract n12 from n1 + n2 in the formula is to not count those results where they appear together twice. Let's test those phrases in Yahoo! using the formula to see which has the highest c-index k1=engineering = 146,000,000 k2= training = 382,000,000 k12=engineering training = 36,200,000 c= 36,200,000/(146,000,000 + 382,000,000 - 36,200,000) c= 36,200,000/491,800,000 c=0.0736 or 73.6 ppt k1=engineering = 146,000,000 k2= workshops = 48,500,000 k12=engineering workshops = 4,900,000 c= 4,900,000/(146,000,000 + 48,500,000 - 4,900,000) c= 4,900,000/189,600,000 c=0.0258 or 25.8 ppt k1=engineering = 146,000,000 k2= seminars = 41,100,000 k12=engineering seminars = 5,000,000 c= 5,000,000/(146,000,000 + 41,100,000 - 5,000,000) c= 5,000,000/182,500,000 c=0.0274 or 27.4 ppt k1=Engineering = 146,000,000 k2= courses = 119,000,000 k12= engineering courses = 17,200,000 c= 17,200,000/(146,000,000 + 119,000,000 - 17,200,000) c= 17,200,000/247,800,000 c=0.0694 or 69.4 ppt Our results (in parts per thousand - to make it easy to compare the numbers): engineering training - 73.6 ppt engineering courses - 69.4 ppt engineering seminars - 27.4 ppt engineering workshops - 25.8 ppt So, the words engineering and training appear together more frequently in documents in Yahoo!'s index than engineering and courses, and engineering and seminars, and then engineering and workshops. Does this mean that more people might search for "engineering training" that "engineering workshops?" Don't know. But we do know that there is a greater percentage of documents in Yahoo!'s index where engineering and training appear within the same document that Engineering and workshops. It is important that one of the keywords is the same here. We use "Engineering" in all of these. If we were using completely different sets of keywords, the comparision wouldn't be worth anything. We know from this comparison, that choosing between the second words in the phrases that training shows up in a higher frequency with engineering than any of the other words. Now, that's looking at what Yahoo! has for these phrases. Keep in mind that Google will have something different. And MSN and ask Jeeves. |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
Apr 24 2005, 09:51 PM |
|
|
QUOTE(Travis) But seriously, the algorithms of search engines are devised in such a way to make the mathematical calculation at the last stage very simple. That's an excellent point, Travis. These things can get pretty complex, and involve a great number of calculations. I'm looking forward to hearing your about your results. QUOTE(bwelford) How often does the two word phrase appear in web pages? Important question, Barry. The patent does look at a number of different relationships. The co-occurrence concept we've looked at is one that might fit best under the "association" relationship described in the patent application. And, looking at the number of pages where the phrase itself appears can be meaningful, too. I believe that Orion describes this in his discussion of co-occurrence at SEW. We also want to consider these phrases under an analysis that involves an "extension" relationship. As I wrote above: QUOTE Also if you place those words or concepts next to each other in a string of associated units, they don't need to make up a new unit. But if they do, then an extension relationship would also exist. So, an extension relationship is really a special kind of an association relationship. For instance, the words \"dog\" and \"Pound\" are probably associated because they may just appear in a large number of queries together. And, placing them next to each other also shows an extension relationship - \"dog pound.\" How might you determine whether or not a phrase has some semantic connection? Possibly by looking at the ratio of results where the exact phrase appears (an exact search using quotation marks) in the pages where both words appear (a findall result - which is what many search engines return when you enter a phrase without quotation marks. where the search engine looks for keyword1 and keyword2 and keyword3, etc.) Some of the math, and some of the potential pitfalls of this approach are more fully covered here: Overlapping Patterns: EF-Ratios, Separators, Patterns and Pitfalls |
||
| Offline | ![]() |
Moderator Alumni![]() Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
|
May 2 2005, 10:49 PM |
|
|
Hi Chinese,
Welcome to the forum. Only the people at Yahoo! really know for sure what they are using, just as only the people at Google know for sure what they are using. The Google patent Application was made public on March 31st. And the Yahoo patent application was made public on April 14th. So yes, only a couple of weeks separate the dates that these patents could have be noticed by the public. Using a patent application as a part of viral marketing would be a potential expensive, and potentially embarassing way of advertising. I do think that it is important that we ask what value each document has. I also think that it's important that we take a look at the methods described, and ask if its worth trying to deconstruct them, and see if the concepts included have value in learning and understanding. I'd say that regardless of whether or not Google or Yahoo! are using those technologies, it's still worth learning about the methods described, and ignoring them is something that people interested in understanding the search engines should do at their own risk. A patent doesn't have to be an indication of what a search engine is actually using, though it is great to get something to look at that is written by the people who work for the search engines. It's at least as worth of study as all of the speculation that circulates around the web. The ideas and concepts behind each patent application are well thought out, and show possible methods for indexing the web using a search engine. Not only that, but if those patent applications become actual patents, they enable each company to exclude others from using the technology described. The ideas presented in the applications cover a wide gamut of notions about how a search engine could work, from Google's look at how historical data could be used to determine credibility and authenticity and relevance, to Yahoo!'s methodology for contemplating how words entered in searchers' queries may exhibit some type of relationship between them, and how exploring those queries and the relationships between them can be used to augment other methods of indexing, including using a human edited directory, and might make search results more relevant. QUOTE But I am still waiting for results proving that speculation on latent semantic analysis is used by google for instance. You know, of course, that much of the speculation tied to whether Google is using some sort of latent semantic analysis was fueled by their acquisition of the company Applied Semantics a couple of years back, and chances are good that some of that technology has probably been used to develop their contextual advertising. Has it been included in the indexing and presentation of search results? We can't be sure. But, you don't stand a chance of knowing or not knowing unless you actually research how Latent semantic indexing actually works. I'm not sure whom it is that you would expect to issue proof that Google is using that type of indexing in their search results. Google wouldn't be the ones to come out and say that they are or aren't. There's no real benefit for the search engines to come out directly, and explain exactly what they are doing. But, there is a benefit to them to issue a patent to protect their intellectual property. Failure to apply for a patent in a timely manner can keep an inventor from ever getting a patent on the material. Such a failure would also keep them from excluding others from using that material. There is a benefit to issuing a patent other than just some press, or a handful of discussions in a very few forums. There's also a benefit to learning about and understanding co-occurrence. I'm not sure if you spend much time with the search engines, performing searches, and experimenting with them, just to see what types of results you get back from the attempts. Or if you care much about how words are distributed around the web, and how the different search engines will index those words differently. Co-occurence can be used as a tool to help understand how a search engine may be working. You really don't need to believe it is something that a search engine is or isn't using at this point to get a benefit from understanding it, and seeing how it can be applied to understand how a search engine indexes words. Regardless of what a search engine does, it does make sense that when you have a choice of phrases which share at least one word, and you can see that the words which make up one of the phrases tend to appear together with a much higher frequency that others of the phrases, that there is probably some type of meaningful connection between those words. Of course, you don't have to bother. You don't need to read these patents, and try to understand them. You don't need to try to make sense of latent semantic indexing or co-occurrence. When the search engines issue patents, you could just believe that it's a publicity stunt, and ignore it. That's certainly your perogative. There is a lot of misinformation, and disinformation on the web. Believe what you want at your own peril. For instance, there are lots of documents on the web that explain how to set up meta tags on your pages for best effect. I've seen hundreds that recommend using a "revisit after" tag, even though the only search engines to really use it include a third tier one that you may not have hear of, and the inventor of the tags -- a regional one in British Columbia, and they've given up on using it. When a search engine sends out information about potential ways to index the material on the web, it really doesn't matter if it is something that they are presently using, something that they may use, or if it is something that they will never use. Understanding that information, and possessing the ability to see how it could be used, and may be used, has a fair amount of value. It allows you to make informed and educated guesses in the absence of insider knowledge. Sure, view it with a rational amount of skepticism. But ignoring it as an advertising stunt without taking the effort to understand it is something that I'm not ready to do. |
||
| Offline | ![]() |
|
|
2 Pages 1 2 >
|
|
| Lo-Fi Version | Time is now: 9th February 2010 - 10:39 AM |
| Meet our Moderators: | cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB |