Jump to content

Leading Community for Usability, Search Engine Marketing,
Social Networking, Site Planning & Web Site Development, Since 1998


Photo

Latest Search Engine Technolgy LSI (Latent Semantic Indexing)


47 replies to this topic

#1 web based expertise

web based expertise

    Unlurked Energy

  • Members
  • 3 posts

Posted 29 May 2006 - 02:41 AM

Hi all, :)

My objective is here to discuss the Latest Search Engine Technolgy LSI (Latent Semantic Indexing).

It is a one of the latest and crucial technology for search engine and help the search engine to retrive the data of web site and present the result in a LSI based technolgy.

We have carried out and updating our extensive reaserch on this latest technolgy of search engine. if anyexpert from search engine industry, want to share/discuss his /her expertise and experience on LSI. They are most welcome.

Thanks

Naveen Gupta

On-line Marketing Consultant

#2 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 29 May 2006 - 05:14 AM

Not sure about the experts, but though I may have dealt with 'latent semantic indexing', I am not sure what it is.
Could you please provide an example? This should get you at least my opinion :)

#3 send2paul

send2paul

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 2870 posts
  • Facebook:https://www.facebook.com/ThatBoyThere

Posted 29 May 2006 - 05:23 AM

For the benefit of all concerned, (apart from Naveen :)),

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, invented in 1990 [1] by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).

from Wikipedia It's a complex looking subject Naveen. Where would you like to start the discussion?

#4 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 29 May 2006 - 05:36 AM

Don't think machines will ever be able to track the natural language.
Not only the language has an unlimited amount of parameters, but it also evolves.
Sure, get as close to the language in 5 years and in 10-20 years you'll be dealing with a new direction the language will take.

Also, some words will lose some of their meanings, some will acquire meanings and some will swap meanings. Will a machine be able to track all this?

Good luck to the scientists and engineers, though.

Edited by A.N.Onym, 29 May 2006 - 05:37 AM.


#5 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 29 May 2006 - 05:50 AM

There have been a number of great posts on this at SEOmoz - Michael Martinez recently wrote a very detailed post on latent semantic indexing, specifically talking about how it's not really been implemented yet in any significant way due to the computational power it requires - even Google can't altogether cut it.

Rand talked about the idea in February of 2005 and gave an interesting early perspective, as well.

Aaron Wall wrote about it at Search Engine Journal, and rustybrick made some extensive comments at SE Roundtable.

It seems to me that the complexity of natural language and the construction of meaning may be a significant barrier to the kind of large-scale analysis a search engine needs to do. A reduced test case may be practical, but at least for the time being I think it's beyond the capability of search engines.

But I don't think that LSA needs to fully "track" natural language, on the other hand - it needs to be able to learn and change; not maintain a fixed idea of how language works. Even with an incomplete implementation, it may well work in an extremely sophisticated and effective manner.

As send2paul said, it's a complex subject - possibly beyond any meaningful application in SEO, and certainly beyond my mathematics :)

#6 JohnMu

JohnMu

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3518 posts

Posted 29 May 2006 - 05:53 AM

I would love for something like this to go live -- Imagine the possibilities (of tricking the machine :)).

If it is really "just" a processor power question, then it would be just a question of time for it to go live. However, as with other similar items (speech recognition comes to mind), processor power is not a cure for everything :)

John

#7 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 29 May 2006 - 05:59 AM

f it is really "just" a processor power question, then it would be just a question of time for it to go live.


Very true...

And, having just looked again, Michael Martinez does not actually say anything about computing power (I know I read that somewhere...but I guess it wasn't that article). What he actually says is:

Unfortunately, the technology does not yet exist to enable the search engines to do that kind of associative indexing. In fact, it would be more appropriate to refer to the process as "associative indexing" because that is really what we are talking about (in this context). The closest we have come to associative indexing in today's search engine technology is stemming, where words are indexed on the basis of their uninflected roots (plural forms, adverbial forms, and adjectival forms are reduced to their simplified noun and verb forms before indexing).



#8 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 29 May 2006 - 06:06 AM

Unfortunately, the computer will need to gather a lot of data before it can jump to a conclusion that the language has evolved.

Take 'car' and 'bottle' for instance.
Currently, there is little association between the two.
What if gas will be measured in bottles? How much time will it require the computer to reach the conclusion? I'd say till it is very well established, but in the meantime, it'll be providing inadequate information :)

Yeah, there shouldn't necessarily be anything 'fixed' in the algorithm. It doesn't mean it becomes much simpler, though.

#9 fisicx

fisicx

    Sonic Boom Member

  • 1000 Post Club
  • 1821 posts

Posted 29 May 2006 - 06:15 AM

It's not the words it's the meaning.

Search for angles. How does the SE know what sort of angle I'm looking for: the ancient peoples of Europe, a mathematical construct, and new approach to a problem. Or did them mean Angels?

We know because we can read the context, the SE has to second guess our intention. For a SE to incorporate LSI it needs to understand the query - this would mean us asking a question: 'what is name given to an angle that is less than 90 degrees'. But we don't want to do this. We bash away using a range of keywords in the hope that one of the results will be useful to us.

If of course the SE could track our investigation it could begin to work out exactly what we are looking for. But this would mean storing everybody's search criteria so that is could over time begin to realize that somebody is a doctor, a scientist, a musician. Only then can LSI begin to work. IMHO.

Incidentally, as a teacher I get student whose first language is not English. Simple statments can be misinterpreted because the literal meaning means something different to the ideomatic meaning.

#10 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 29 May 2006 - 06:23 AM

I'm not sure that's significantly different from how language is communicated in society - perhaps the association of car and bottles begins gradually, appearing occasionally in media sources or in advertising. Over the course of 10 years, it becomes a standard association. From a search perspective, this may be a semantic association which is very unimportant at first - and that's exactly what it should be, because the association of car and bottle is a very weak semantic link if it's purely an association due to a few dozen media sources, etc.

As the language continues to develop and the terms are used more and more in conjunction, the semantic engine would learn that they are more importantly matched, and may begin to make that association.

I don't think that this is the chief problem for semantic indexing; I think a bigger problem may be coping with multiple meanings. The terms themselves are insignificant - how does the engine know that I MEAN to be looking for ancient mayan symbology when I search for "jaguar", instead of a car or a cat?

And, like fisicx says, literal meaning is very difficult to manage within the context of idiomatic speech - when I say in my blog that I'm off to "hit the hay" a human will easily identify that idiom - but an algorithm may have difficulty associating "hit hay" with going to bed.

If of course the SE could track our investigation it could begin to work out exactly what we are looking for. But this would mean storing everybody's search criteria so that is could over time begin to realize that somebody is a doctor, a scientist, a musician. Only then can LSI begin to work. IMHO.


Ultimately, do I want my searches to be incredibly effective because the search engine knows everything about my life and interests? Not really...I'd rather have privacy and have to work a bit harder to find the information I need! Even with all this collection of information, it would be very difficult for the search engine to manage ALL the varied interests and possible curious questions somebody would have over the course of their life.

#11 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9293 posts
  • Twitter:http://twitter.com/#!/Ammon_Johns
  • Facebook:http://www.facebook.com/ammon.johns

Posted 29 May 2006 - 09:06 AM

We have carried out and updating our extensive reaserch on this latest technolgy of search engine. if anyexpert from search engine industry, want to share/discuss his /her expertise and experience on LSI. They are most welcome.

I hate to rain on a parade for 'the latest search engine technology', but we first covered this subject a few years ago. In fact, within a few months of first opening these forums in 2002.
http://www.cre8asite...p?showtopic=593

In 2003, we were already directly applying advice regarding LSA to sites and techniques being discussed.
http://www.cre8asite...latent semantic

If LSI/LSA were really the latest technology then those hundreds of Information Retrieval scientists haven't been doing much in the years since. :) Seriously though, many of the most fundamental published papers on LSA were published in the mid nineties, and I've seen a lot of papers with publication dates from '94 to '96.

LSA is far older than many more significant updates to technology, including the infamous 'Florida' update which itself now seems quite ancient history.

Far more recently (a mere 2+ years ago, in January 2004) there was mention of a specific use of Semantic Analysis by a specific engine in the following discussion.
http://www.cre8asite...?showtopic=5394

A good place to start with studying LSA would be the engines themselves.
http://www.google.co...indexing&num=50

#12 Spencer Hoyt

Spencer Hoyt

    Unlurked Energy

  • Members
  • 6 posts

Posted 30 May 2006 - 02:02 PM

I can't believe that you guys are not using LSI!!!
It is one of my secret weapons for competitive terms. If you would like to learn more on LSI Google Michael Marshall. He is the SEO who basically invented LSI and SEO.
Good Luck. :applause:

#13 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 30 May 2006 - 02:11 PM

I can't believe that you guys are not using LSI!!!
It is one of my secret weapons for competitive terms. If you would like to learn more on LSI Google Michael Marshall. He is the SEO who basically invented LSI and SEO.


I'm not sure exactly what you mean by "using" latent semantic indexing. Latent semantic indexing is a technique which construct an algorithm which can identify the essential meaning of your query or website and use that knowledge to make connections despite the lack of a direct key-term relationship. (To attempt to describe it, however imprecisely.)

If you're talking about the same Michael Marshall I know of, then he has written some interesting articles about writing content with LSI in mind.

However, as interesting as this article is, Michael Marshall certainly did not actually "invent" LSI or SEO.

I'd be interested in hearing how LSI has explicitly aided your SEO campaigns - how have you applied the concepts of LSI to your content and what demonstrates that this has helped you?

#14 bragadocchio

bragadocchio

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15634 posts

Posted 30 May 2006 - 03:28 PM

It is a one of the latest and crucial technology for search engine and help the search engine to retrive the data of web site and present the result in a LSI based technolgy.


The concept of Indexing by Latent Semantic Analysis (pdf) was probably introduced in 1990, though it relies on a lot of research from the 1960s, 70s, and 80s (see the list of citations in the paper I linked to.)

The biggest public plunge that Google has probably made concerning any type of semantic indexing was their purchase of the company Applied Semantics, with its Conceptual Information Retrieval and Communication Architecture (CIRCA) technology. A number of similar ideas surfaced in two recent patent applications from Google:

Phrase-based searching in an information retrieval system

Multiple index based information retrieval system

Think about Google's supplemental index while looking at this one above. :)

See also the Applied Semantics Patents:

Meaning-based information organization and retrieval

Meaning-based advertising and document relevance determination

#15 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9293 posts
  • Twitter:http://twitter.com/#!/Ammon_Johns
  • Facebook:http://www.facebook.com/ammon.johns

Posted 30 May 2006 - 05:46 PM

It is always difficult to separate Semantics from Latent Semantics. However, there is of course a big difference in implication (and in spam-proofing).

The work of Applied Semantics was largely concerned with non-latent semantics. It was about contextual semantic clues usually within the same body of text. Latent Semantics is not about the words appearing in the document, but rather about those latent clues not spoken/writen consciously.

When Bill links to a paper, even one with a non-obvious title, we can gather a context to this from the very fact that Bill linked to it, and in what context, before we ever even think of opening that document itself.

Likewise, things can be presented to look like something that they really are not. An example of this might well be some press releases. Just because the document presents itself as news, and uses the language of a news bulletin, does not in itself actually make it news. Where the press release appears, and how much attention it gets (could be measured in links or in viewing activity) does far more to identify real news than the language in the document ever will.

A non-event press release created just for a link that nobody reads may appear through active semantics to be the real news it is not.

Conversely an item on the CNN website might not appear to look like news at all, but really is just from where it is located.

Latent Semantics is about unstated context, and unspoken/unwritten clues and cues. There is some crossover between the two things, quite naturally. But latent semantics is far harder to fake, and thus is of a far more sturdy and robust use for search engines.

#16 A.N.Onym

A.N.Onym

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 4001 posts
  • Twitter:http://twitter.com/yuraf
  • Facebook:http://www.facebook.com/yura.filimonov

Posted 30 May 2006 - 06:28 PM

Thank you for the explanation, BK.
Don't think the Wikipedia entry mentioned anything of that.

Well, it'll be hard for the search engines to figure out what to take into account.
I expect a lot of errors here, provided that they are trying to describe the undescribable (or at least less obvious).

If I were a search engine, I'd make sure I know my simple semantics well enough before jumping into something serious.

Btw, probably Google Co-op may help fighting the spam by supply samples of trusted sites, too, so perhaps they aren't going to spend most of their resources on LS. However, this field is really tempting for the search engines.

#17 web based expertise

web based expertise

    Unlurked Energy

  • Members
  • 3 posts

Posted 12 June 2006 - 08:28 AM

Hi, :)

Some quick Fact about LSI –

1. LSI is 30% more effective than popular word matching method.
2. LSI uses a powerful and fully automatic statistical method (Singular Value Decomposition)
3. It is very effective in cross-languages retrievals.
5. LSI can retrieve relevant information that does not contain query words,
6. It finds more relevant information than other methods.

There are some examples for LSI based SEO and Non LSI based SEO, Please see and analyze the difference.

Result on Google for Query “Laptop” without LSI
(Kindly analyze the result on Top 10 pages format of Google)

http://www.google.co...en&lr=&q=laptop

Result on Google for query “ ~Laptop “ with LSI
http://www.google.co.....lr=&q=~laptop

Result on Google for Query “Mobile” without LSI
http://www.google.co...le &btnG=Search

Result on Google for Query “~Mobile” with LSI

http://www.google.co.....r=&q=~mobile


we have written a 37 e-book on LSI. it includes the some fact about LSI.

kindly let me know if you enjoy reading.I will mail it by PM.

Google is implementing it its semantic result. :cheers:

we are making our research more extensive with the latest fact and figure.

Thanks

Naveen

Edited by web based expertise, 12 June 2006 - 08:42 AM.


#18 joedolson

joedolson

    Eyes Like Hawk Moderator

  • Technical Administrators
  • 2869 posts
  • Twitter:http://twitter.com/joedolson
  • Facebook:http://facebook.com/joedolson

Posted 12 June 2006 - 08:43 AM

Unless I've completely misunderstood, the tilde operator in Google has practically nothing to do with latent semantics - the tilde operator simply cross-indexes your search with searches on synonyms of the inputted search terms and applies stemming to identify alternate forms of the term searched. At best, this could be called inherent or explicit semantics - structured meaning which is related explicitly within the strict definition of the term searched.

Latent semantics would involve a more sophisticated analysis of the context of your search - particularly applicable in multiple term searches where the terms could be determined to have particular relationships.

Furthermore, latent semantics would require a contextual examination of the overal context of the pages returned, which to the best of my knowledge, Google's operator does not perform.

#19 Black_Knight

Black_Knight

    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9293 posts
  • Twitter:http://twitter.com/#!/Ammon_Johns
  • Facebook:http://www.facebook.com/ammon.johns

Posted 12 June 2006 - 09:22 AM

Using "~keyword" in a search has nothing whatever to do with Latent Semantic Analysis. That's standard semantics, such as applied semantics, with nothing latent about it.

The ~ operator is simply used to find synonyms and stem-variants of a given word.

Please tell me that your 37 ebooks on LSA do indeed know what LSA is, and what it is not.

This thread from the Search Engine Watch Forums discusses c-index methods for building up 'thesaurus' of related words by co-occurrence. As Orion states there, c-indexes date back to the 70s at least, and so far predate any existing search engine.

However, these too are mainly about 'active' semantics, not 'latent' semantics. Latent semantics are the things you only tend to spot from a distance, with fractals or some other plotting/graphing analysis.

The best way of explaining the difference might be to look at ways of blocking spam (such as emails).

If we use exact negative matching (allow no email that uses the word 'viagra' for example) this can get a wide range of false positives, blocking emails that were not actually spam at all.

Using semantics, we might look for variants of the word viagra, especially misspellings that are attempting to bypass the first kind of exact matching we just mentioned - like those that use "VlAGRA" (the I there is actually a lower-case L)

Latent semantics is the system that noted you could block over 70% of all spam by just blocking anything that used bold red exclaimation points as punctuation. :)

Bayesian spam filtering methods are far more along the line of latent semantics than most other systems.

#20 yannis

yannis

    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 12 June 2006 - 09:34 AM

If you click here web based expertise you will get to learn the importance of anchor text! What you pointing is actually irrelevant and misleading. Self bragging about your SEO skills and e-book about LSI technology leaves me very cold and is very unprofessional at least in this forum. You have actually contibuted nothing to this post other than telling us how good you are!

The proof is in the padding! Here is a challenge. Post a before and after 'LSI Technology' page and let the pros here comment on the techniques employed. Please spell check the page before you post! You prove your techniques with the relevant mathematics and I will eat my words and my summer hat!



Reply to this topic



  


0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users