Jump to content

Cre8asiteforums Internet Marketing
and Conversion Web Design


Latest Search Engine Technolgy LSI (Latent Semantic Indexing)

  • Please log in to reply
47 replies to this topic

#1 web based expertise

web based expertise

    Unlurked Energy

  • Members
  • 3 posts

Posted 29 May 2006 - 02:41 AM

Hi all, :)

My objective is here to discuss the Latest Search Engine Technolgy LSI (Latent Semantic Indexing).

It is a one of the latest and crucial technology for search engine and help the search engine to retrive the data of web site and present the result in a LSI based technolgy.

We have carried out and updating our extensive reaserch on this latest technolgy of search engine. if anyexpert from search engine industry, want to share/discuss his /her expertise and experience on LSI. They are most welcome.


Naveen Gupta

On-line Marketing Consultant

#2 A.N.Onym


    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 29 May 2006 - 05:14 AM

Not sure about the experts, but though I may have dealt with 'latent semantic indexing', I am not sure what it is.
Could you please provide an example? This should get you at least my opinion :)

#3 send2paul


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 2935 posts

Posted 29 May 2006 - 05:23 AM

For the benefit of all concerned, (apart from Naveen :)),

Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, invented in 1990 [1] by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).

from Wikipedia It's a complex looking subject Naveen. Where would you like to start the discussion?

#4 A.N.Onym


    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 29 May 2006 - 05:36 AM

Don't think machines will ever be able to track the natural language.
Not only the language has an unlimited amount of parameters, but it also evolves.
Sure, get as close to the language in 5 years and in 10-20 years you'll be dealing with a new direction the language will take.

Also, some words will lose some of their meanings, some will acquire meanings and some will swap meanings. Will a machine be able to track all this?

Good luck to the scientists and engineers, though.

Edited by A.N.Onym, 29 May 2006 - 05:37 AM.

#5 Guest_joedolson_*

  • Guests

Posted 29 May 2006 - 05:50 AM

There have been a number of great posts on this at SEOmoz - Michael Martinez recently wrote a very detailed post on latent semantic indexing, specifically talking about how it's not really been implemented yet in any significant way due to the computational power it requires - even Google can't altogether cut it.

Rand talked about the idea in February of 2005 and gave an interesting early perspective, as well.

Aaron Wall wrote about it at Search Engine Journal, and rustybrick made some extensive comments at SE Roundtable.

It seems to me that the complexity of natural language and the construction of meaning may be a significant barrier to the kind of large-scale analysis a search engine needs to do. A reduced test case may be practical, but at least for the time being I think it's beyond the capability of search engines.

But I don't think that LSA needs to fully "track" natural language, on the other hand - it needs to be able to learn and change; not maintain a fixed idea of how language works. Even with an incomplete implementation, it may well work in an extremely sophisticated and effective manner.

As send2paul said, it's a complex subject - possibly beyond any meaningful application in SEO, and certainly beyond my mathematics :)

#6 JohnMu


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 3519 posts

Posted 29 May 2006 - 05:53 AM

I would love for something like this to go live -- Imagine the possibilities (of tricking the machine :)).

If it is really "just" a processor power question, then it would be just a question of time for it to go live. However, as with other similar items (speech recognition comes to mind), processor power is not a cure for everything :)


#7 Guest_joedolson_*

  • Guests

Posted 29 May 2006 - 05:59 AM

f it is really "just" a processor power question, then it would be just a question of time for it to go live.

Very true...

And, having just looked again, Michael Martinez does not actually say anything about computing power (I know I read that somewhere...but I guess it wasn't that article). What he actually says is:

Unfortunately, the technology does not yet exist to enable the search engines to do that kind of associative indexing. In fact, it would be more appropriate to refer to the process as "associative indexing" because that is really what we are talking about (in this context). The closest we have come to associative indexing in today's search engine technology is stemming, where words are indexed on the basis of their uninflected roots (plural forms, adverbial forms, and adjectival forms are reduced to their simplified noun and verb forms before indexing).

#8 A.N.Onym


    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 29 May 2006 - 06:06 AM

Unfortunately, the computer will need to gather a lot of data before it can jump to a conclusion that the language has evolved.

Take 'car' and 'bottle' for instance.
Currently, there is little association between the two.
What if gas will be measured in bottles? How much time will it require the computer to reach the conclusion? I'd say till it is very well established, but in the meantime, it'll be providing inadequate information :)

Yeah, there shouldn't necessarily be anything 'fixed' in the algorithm. It doesn't mean it becomes much simpler, though.

#9 fisicx


    Sonic Boom Member

  • Hall Of Fame
  • 1976 posts

Posted 29 May 2006 - 06:15 AM

It's not the words it's the meaning.

Search for angles. How does the SE know what sort of angle I'm looking for: the ancient peoples of Europe, a mathematical construct, and new approach to a problem. Or did them mean Angels?

We know because we can read the context, the SE has to second guess our intention. For a SE to incorporate LSI it needs to understand the query - this would mean us asking a question: 'what is name given to an angle that is less than 90 degrees'. But we don't want to do this. We bash away using a range of keywords in the hope that one of the results will be useful to us.

If of course the SE could track our investigation it could begin to work out exactly what we are looking for. But this would mean storing everybody's search criteria so that is could over time begin to realize that somebody is a doctor, a scientist, a musician. Only then can LSI begin to work. IMHO.

Incidentally, as a teacher I get student whose first language is not English. Simple statments can be misinterpreted because the literal meaning means something different to the ideomatic meaning.

#10 Guest_joedolson_*

  • Guests

Posted 29 May 2006 - 06:23 AM

I'm not sure that's significantly different from how language is communicated in society - perhaps the association of car and bottles begins gradually, appearing occasionally in media sources or in advertising. Over the course of 10 years, it becomes a standard association. From a search perspective, this may be a semantic association which is very unimportant at first - and that's exactly what it should be, because the association of car and bottle is a very weak semantic link if it's purely an association due to a few dozen media sources, etc.

As the language continues to develop and the terms are used more and more in conjunction, the semantic engine would learn that they are more importantly matched, and may begin to make that association.

I don't think that this is the chief problem for semantic indexing; I think a bigger problem may be coping with multiple meanings. The terms themselves are insignificant - how does the engine know that I MEAN to be looking for ancient mayan symbology when I search for "jaguar", instead of a car or a cat?

And, like fisicx says, literal meaning is very difficult to manage within the context of idiomatic speech - when I say in my blog that I'm off to "hit the hay" a human will easily identify that idiom - but an algorithm may have difficulty associating "hit hay" with going to bed.

If of course the SE could track our investigation it could begin to work out exactly what we are looking for. But this would mean storing everybody's search criteria so that is could over time begin to realize that somebody is a doctor, a scientist, a musician. Only then can LSI begin to work. IMHO.

Ultimately, do I want my searches to be incredibly effective because the search engine knows everything about my life and interests? Not really...I'd rather have privacy and have to work a bit harder to find the information I need! Even with all this collection of information, it would be very difficult for the search engine to manage ALL the varied interests and possible curious questions somebody would have over the course of their life.

#11 Black_Knight


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9417 posts

Posted 29 May 2006 - 09:06 AM

We have carried out and updating our extensive reaserch on this latest technolgy of search engine. if anyexpert from search engine industry, want to share/discuss his /her expertise and experience on LSI. They are most welcome.

I hate to rain on a parade for 'the latest search engine technology', but we first covered this subject a few years ago. In fact, within a few months of first opening these forums in 2002.

In 2003, we were already directly applying advice regarding LSA to sites and techniques being discussed.
http://www.cre8asite...latent semantic

If LSI/LSA were really the latest technology then those hundreds of Information Retrieval scientists haven't been doing much in the years since. :) Seriously though, many of the most fundamental published papers on LSA were published in the mid nineties, and I've seen a lot of papers with publication dates from '94 to '96.

LSA is far older than many more significant updates to technology, including the infamous 'Florida' update which itself now seems quite ancient history.

Far more recently (a mere 2+ years ago, in January 2004) there was mention of a specific use of Semantic Analysis by a specific engine in the following discussion.

A good place to start with studying LSA would be the engines themselves.

#12 Spencer Hoyt

Spencer Hoyt

    Unlurked Energy

  • Members
  • 6 posts

Posted 30 May 2006 - 02:02 PM

I can't believe that you guys are not using LSI!!!
It is one of my secret weapons for competitive terms. If you would like to learn more on LSI Google Michael Marshall. He is the SEO who basically invented LSI and SEO.
Good Luck. :applause:

#13 Guest_joedolson_*

  • Guests

Posted 30 May 2006 - 02:11 PM

I can't believe that you guys are not using LSI!!!
It is one of my secret weapons for competitive terms. If you would like to learn more on LSI Google Michael Marshall. He is the SEO who basically invented LSI and SEO.

I'm not sure exactly what you mean by "using" latent semantic indexing. Latent semantic indexing is a technique which construct an algorithm which can identify the essential meaning of your query or website and use that knowledge to make connections despite the lack of a direct key-term relationship. (To attempt to describe it, however imprecisely.)

If you're talking about the same Michael Marshall I know of, then he has written some interesting articles about writing content with LSI in mind.

However, as interesting as this article is, Michael Marshall certainly did not actually "invent" LSI or SEO.

I'd be interested in hearing how LSI has explicitly aided your SEO campaigns - how have you applied the concepts of LSI to your content and what demonstrates that this has helped you?

#14 BillSlawski


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15667 posts

Posted 30 May 2006 - 03:28 PM

It is a one of the latest and crucial technology for search engine and help the search engine to retrive the data of web site and present the result in a LSI based technolgy.

The concept of Indexing by Latent Semantic Analysis (pdf) was probably introduced in 1990, though it relies on a lot of research from the 1960s, 70s, and 80s (see the list of citations in the paper I linked to.)

The biggest public plunge that Google has probably made concerning any type of semantic indexing was their purchase of the company Applied Semantics, with its Conceptual Information Retrieval and Communication Architecture (CIRCA) technology. A number of similar ideas surfaced in two recent patent applications from Google:

Phrase-based searching in an information retrieval system

Multiple index based information retrieval system

Think about Google's supplemental index while looking at this one above. :)

See also the Applied Semantics Patents:

Meaning-based information organization and retrieval

Meaning-based advertising and document relevance determination

#15 Black_Knight


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9417 posts

Posted 30 May 2006 - 05:46 PM

It is always difficult to separate Semantics from Latent Semantics. However, there is of course a big difference in implication (and in spam-proofing).

The work of Applied Semantics was largely concerned with non-latent semantics. It was about contextual semantic clues usually within the same body of text. Latent Semantics is not about the words appearing in the document, but rather about those latent clues not spoken/writen consciously.

When Bill links to a paper, even one with a non-obvious title, we can gather a context to this from the very fact that Bill linked to it, and in what context, before we ever even think of opening that document itself.

Likewise, things can be presented to look like something that they really are not. An example of this might well be some press releases. Just because the document presents itself as news, and uses the language of a news bulletin, does not in itself actually make it news. Where the press release appears, and how much attention it gets (could be measured in links or in viewing activity) does far more to identify real news than the language in the document ever will.

A non-event press release created just for a link that nobody reads may appear through active semantics to be the real news it is not.

Conversely an item on the CNN website might not appear to look like news at all, but really is just from where it is located.

Latent Semantics is about unstated context, and unspoken/unwritten clues and cues. There is some crossover between the two things, quite naturally. But latent semantics is far harder to fake, and thus is of a far more sturdy and robust use for search engines.

#16 A.N.Onym


    Honored One Who Served Moderator Alumni

  • Invited Users For Labs
  • 4003 posts

Posted 30 May 2006 - 06:28 PM

Thank you for the explanation, BK.
Don't think the Wikipedia entry mentioned anything of that.

Well, it'll be hard for the search engines to figure out what to take into account.
I expect a lot of errors here, provided that they are trying to describe the undescribable (or at least less obvious).

If I were a search engine, I'd make sure I know my simple semantics well enough before jumping into something serious.

Btw, probably Google Co-op may help fighting the spam by supply samples of trusted sites, too, so perhaps they aren't going to spend most of their resources on LS. However, this field is really tempting for the search engines.

#17 web based expertise

web based expertise

    Unlurked Energy

  • Members
  • 3 posts

Posted 12 June 2006 - 08:28 AM

Hi, :)

Some quick Fact about LSI –

1. LSI is 30% more effective than popular word matching method.
2. LSI uses a powerful and fully automatic statistical method (Singular Value Decomposition)
3. It is very effective in cross-languages retrievals.
5. LSI can retrieve relevant information that does not contain query words,
6. It finds more relevant information than other methods.

There are some examples for LSI based SEO and Non LSI based SEO, Please see and analyze the difference.

Result on Google for Query “Laptop” without LSI
(Kindly analyze the result on Top 10 pages format of Google)


Result on Google for query “ ~Laptop “ with LSI

Result on Google for Query “Mobile” without LSI
http://www.google.co...le &btnG=Search

Result on Google for Query “~Mobile” with LSI


we have written a 37 e-book on LSI. it includes the some fact about LSI.

kindly let me know if you enjoy reading.I will mail it by PM.

Google is implementing it its semantic result. :cheers:

we are making our research more extensive with the latest fact and figure.



Edited by web based expertise, 12 June 2006 - 08:42 AM.

#18 Guest_joedolson_*

  • Guests

Posted 12 June 2006 - 08:43 AM

Unless I've completely misunderstood, the tilde operator in Google has practically nothing to do with latent semantics - the tilde operator simply cross-indexes your search with searches on synonyms of the inputted search terms and applies stemming to identify alternate forms of the term searched. At best, this could be called inherent or explicit semantics - structured meaning which is related explicitly within the strict definition of the term searched.

Latent semantics would involve a more sophisticated analysis of the context of your search - particularly applicable in multiple term searches where the terms could be determined to have particular relationships.

Furthermore, latent semantics would require a contextual examination of the overal context of the pages returned, which to the best of my knowledge, Google's operator does not perform.

#19 Black_Knight


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9417 posts

Posted 12 June 2006 - 09:22 AM

Using "~keyword" in a search has nothing whatever to do with Latent Semantic Analysis. That's standard semantics, such as applied semantics, with nothing latent about it.

The ~ operator is simply used to find synonyms and stem-variants of a given word.

Please tell me that your 37 ebooks on LSA do indeed know what LSA is, and what it is not.

This thread from the Search Engine Watch Forums discusses c-index methods for building up 'thesaurus' of related words by co-occurrence. As Orion states there, c-indexes date back to the 70s at least, and so far predate any existing search engine.

However, these too are mainly about 'active' semantics, not 'latent' semantics. Latent semantics are the things you only tend to spot from a distance, with fractals or some other plotting/graphing analysis.

The best way of explaining the difference might be to look at ways of blocking spam (such as emails).

If we use exact negative matching (allow no email that uses the word 'viagra' for example) this can get a wide range of false positives, blocking emails that were not actually spam at all.

Using semantics, we might look for variants of the word viagra, especially misspellings that are attempting to bypass the first kind of exact matching we just mentioned - like those that use "VlAGRA" (the I there is actually a lower-case L)

Latent semantics is the system that noted you could block over 70% of all spam by just blocking anything that used bold red exclaimation points as punctuation. :)

Bayesian spam filtering methods are far more along the line of latent semantics than most other systems.

#20 yannis


    Sonic Boom Member

  • 1000 Post Club
  • 1634 posts

Posted 12 June 2006 - 09:34 AM

If you click here web based expertise you will get to learn the importance of anchor text! What you pointing is actually irrelevant and misleading. Self bragging about your SEO skills and e-book about LSI technology leaves me very cold and is very unprofessional at least in this forum. You have actually contibuted nothing to this post other than telling us how good you are!

The proof is in the padding! Here is a challenge. Post a before and after 'LSI Technology' page and let the pros here comment on the techniques employed. Please spell check the page before you post! You prove your techniques with the relevant mathematics and I will eat my words and my summer hat!

#21 Guest_orion_*

  • Guests

Posted 10 July 2006 - 12:45 PM

Hi, there.

This might shed some light to the subject, to the meaning of the terms "latent" and "semantic structure" and what one can get from LSI in general: Demystifying LSA, LSI, SVD, PCA, AND IS acronisms

Hope this help. Sorry I cannot follow the discussion. I'm too busy.

Edited by orion, 10 July 2006 - 12:47 PM.

#22 SEOEgghead


    Whirl Wind Member

  • Members
  • 50 posts

Posted 11 July 2006 - 02:01 PM

If there's nothing "latent" about ~, why would "~hut" bring up Dominos Pizza?

I'm not an expert here, but it doesn't seem like a direct synonym, and they certainly do not have the same stem.

#23 Guest_orion_*

  • Guests

Posted 19 July 2006 - 02:04 PM

I'm working on and off on some tutorials on SVD (which is at the heart of LSI) and PCA (often mistaken for SVD). I'm writing them when I can find time, so it will take a bit of time to complete.

These will show stepwise how-to calculations so SEMs/SEOs could do the analysis without having to pay a dime to anyone (unless they want to). The tut's will get into the nitty gritty of SVD and PCA and perhaps shed some light to the topic at hand.


#24 Black_Knight


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9417 posts

Posted 19 July 2006 - 03:42 PM

Off Topic offtopicApparently Google are making extensive use of user data at the moment. The latest is that Google are actively watching the browsing habits of any user who has selected the Hebrew language, or uses http://www.google.co.il/ as their default search engine.

Yes, it is finally an application of Latent Semitic Analysis! :)

#25 Guest_orion_*

  • Guests

Posted 11 August 2006 - 02:53 PM

The Singular Value Decomposition and Latent Semantic Indexing Tutorial Series is now available at http://www.miislita.com

So far, only Part 1 and 2 are available. The series is designed to provide the non specialist (IR students and search engine marketers) with how-to calculation instructions and to debunk/demystify the many myths about SVD and LSI many SEMs/SEOs have.

SVD and LSI Tutorial 1: Understanding SVD and LSI

This tutorial introduces you to SVD and LSI. Includes:

1. Search Engine Marketers and their LSI Myths.
2. SVD/LSI Applications and Limitations.
3. A Geometrical Visualization of SVD.

SVD and LSI Tutorial 2: Computing Singular Values

This tutorial shows you how to compute singular values. Includes:

1. Matrix Transposition.
2. The Frobenius Norm.
3. Computing singular values and singular matrices.

Subsequent parts shows how to compute the Full SVD, stepwise how-to calculations on how LSI scores and rank documents and new advances in the field. They will soon be out. I will eventually show you how to compute/play with LSI for your projects without having to pay a dime to anyone.

Enjoy it.

#26 Guest_orion_*

  • Guests

Posted 26 August 2006 - 05:01 PM

And here is my case against a portion of the SEO industry that are just LSI-based Snake Oil Marketers

Edited by orion, 26 August 2006 - 05:02 PM.

#27 Guest_orion_*

  • Guests

Posted 11 September 2006 - 12:01 PM

SVD and LSI Tutorial 3: Computing the Full SVD of a Matrix is available. I have summarized in five easy steps the SVD calculations. These include a handy shortcut to reduced computational overhead and as follows:

1. Given a matrix compute its transpose and the "right" matrix.

2. determine the eigenvalues of the "right" matrix and sort these in descending order, in the absolute sense. Square roots these to obtain singular values.

3. Construct the diagonal matrix (S) by placing singular values in descending order along its diagonal. Compute its inverse.

4. use the ordered eigenvalues from step 2 and compute the eigenvectors of the "right" matrix. Place these eigenvectors along the columns of a new matrix (V) and compute its transpose.

5. Compute U using the shortcut described in tutorial. To complete the proof, reconstruct the original matrix by computing its full SVD.

These steps are easier to grasp in the tutorial since visual aids are provided.

I am working on Part 4, which describes how SVD is used in LSI. I am working also on a fast track, when I can find time. In this way readers can use the fast track as a quick reference, rather than reading the entire series. I am pondering an additional part (Part 5). This would include how-to instructions to crunch LSI data using software, so anyone can play with it and test things.

Part 4 includes a working example that explains how LSI scores documents and queries. This is quite a straightforward procedure. The important thing is that in the process many search marketing misconceptions are exposed and myths are dispeled.

Here is one of such myths: that LSI is document indexing. At least in the SEO industry, this myth is in part the result of naming the technique using the "indexing" token, and in part the result of search marketers resourcing to imagination, in order to market their services. Why then IRs use "indexing" when refering to the technique? Wait and see.

Time to demystify more SEO myths and "LSI based" Snake Oil Marketers.

#28 Guest_orion_*

  • Guests

Posted 20 September 2006 - 12:49 PM

Here is Part 4 of the SVD and LSI Tutorial series.


Note how straightforward is LSI.

Now anyone can compute LSI scores (or at least understand the basic calculations) with nothing, but just an online matrix calculator. I have included some basic procedures.

This should help SEOs to get out of their head more myths and misconceptions regarding LSI.

Enjoy it and stay away from LSI snake oil sellers.

Dr. E. Garcia

#29 BillSlawski


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15667 posts

Posted 20 September 2006 - 01:13 PM

Well this was a pretty basic tutorial on LSI.


Thanks, Dr. Garcia. Your efforts towards explaining, and attempting to put these concepts in an easier to understand language is appreciated very much. I think that these topics are valuable and worth learning by people who engage in search marketing.

But I also think that there's a fairly steep learning curve when it comes to these topics for most folks who don't have the educational background. :(

There are parts of your tutorials that are very clear and easy to understand, but other parts that require a baseline of knowledge that many might not have. I know that you have some of that baseline material on your site, like information about arrays. In one of your previous sections from this tutorial, you mention those at the top of the tutorial. I wonder if it would be helpful to include a very short section with links and information about those more elementary sections at the top of this page.

#30 Guest_orion_*

  • Guests

Posted 20 September 2006 - 02:14 PM

Well. This is a tutorial series. The target audience are IR students, researchers and search marketers new to IR and interested in LSI. So I assume that before reading Part 4 they have assimilated Part 1, 2 and 3.

In Part 1 I have placed in big bold red text the warning and links you refers to. However, your suggestion has merits. I am repeating the same warning and links.

With all, I know some readers will try to cut corners, only to find out they need to go back and visit those links.


Dr. E. Garcia

#31 BillSlawski


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15667 posts

Posted 20 September 2006 - 03:02 PM

Those are good points.

With many multiple part articles, I like to note at the top something like - "This is part three of a series. The other parts are: xxxx xxxx, xxxx xxx, and xxxx" - with links to them.

That's all that I mean there. :D

Some of the math and the arrays can be intimidating without reading your excellent introduction to arrays.

#32 Guest_orion_*

  • Guests

Posted 20 September 2006 - 04:24 PM

Good point.

I have that approach in several series of articles (Term Vector, Fractals and Keyword Co-Occurrence series), but thought for the tutorials was not necessary since the titles and links between documents of the SVD and LSI series already states "Part 1", "Part 2", an so on.

The simplest way to look at matrix arrays is by thinking of these as mere tables, as mentioned in the matrix tutorial series. This takes away the false perception of "God, matrix arrays are so hard." These are mere tables, where columns (or rows) are viewed as vectors.


Dr. E. Garcia

#33 Mano70


    Mach 1 Member

  • Members
  • 256 posts

Posted 20 September 2006 - 04:31 PM

I must say as bragadocchio, thank's for putting this online.

I'm neither IR student, researcher or a search marketer, but I still find these articles very helpful. Not sure I understand everything (and I don't have to since this isn't my work field), but sometimes it's just enough to get the big picture, and I get that also from your articles. Have referenced some to the articles when the LSI SEO question came up (on some other forums).

Edited by Mano70, 20 September 2006 - 04:32 PM.

#34 BillSlawski


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 15667 posts

Posted 20 September 2006 - 04:44 PM

I must say as bragadocchio, thank's for putting this online.

Just to avoid and confusion from others who read this thread, in case there is any, the tutorials are Dr. Garcia's.

And yes, they are very helpful. I've been getting a lot out of them myself. And yes, there are parts that I'm not sure that I understand completely either, but I think it's worth trying. :D

#35 Guest_orion_*

  • Guests

Posted 21 September 2006 - 09:50 AM


Unfortunately, to replicate LSI how-to calculations one needs to know basic linear algebra operations. Without this background some parts might be uneasy to follow, as Bill pointed out. I could have resource to a blackbox approach, but that would dilute the main goals of the series.

Here is a blackbox approach you could try.

Warning: Term Count Model Ahead, so this is not how LSI is used these days, but can give you some ideas on how things are done.

1. Query a search engine and collect top 3 ranked titles. Treat these as "documents", so your "collection" consists of 3 documents. (D=3)

2. Remove all stopwords and punctuation. Lowercase results. Ignore stemming.

3. Construct with survival terms a term-document matrix.

4. Use the Bluebit Calculator or any calculator that does SVD. Copy/paste the matrix in the calculator. (Check last tutorial for setting options). Submit.

5. You should see three matrices. U, S and V. Keep the first 2 columns of U and V and first 2 columns and rows of S. Compute query-document cosine similarity values and sort results in descending order. Check tutorials for details.

If you don't know matrix operations, do this:

1. Visit a site that has matrix calculators and does matrix multiplication. There are plenty around.

2. Copy and paste the generated query matrices to multiply these and come up with query vector coordinates. Check tutorial to identify these matrices.

3. The rows of V gives you document vector coordinates. Compute query-document cosine similarity values and sort results in descending order.

Note this procedure uses the Term Count Model and inherits many theoretical limitations, but at least gives you some ideas on how things work.

Hope this help.

Dr. E. Garcia

Edited by orion, 21 September 2006 - 09:57 AM.

#36 Guest_orion_*

  • Guests

Posted 22 September 2006 - 09:03 AM

The quick reference for the series on SVD and LSI, the LSI Fast Track Tutorial, is now available at


Note that there are no "magic words" in LSI.

Note also how term vector theory is still used, at the begining and at the end of the SVD decomposition.

Dr. E. Garcia

#37 Guest_orion_*

  • Guests

Posted 03 October 2006 - 12:50 PM

Here is Mike Grehan recent ClickZ column Lies, Lies, and LSI. In that article Mike and Randfish take positions on the whole issue of certain SEOs trying to market LSI services. Here is my response to few comments made by Randfish.

#38 Guest_orion_*

  • Guests

Posted 19 October 2006 - 08:25 AM

Finally, here is the last article of the tutorial series on SVD and LSI:

LSI Keyword Research and Co-Occurrence Theory

In this LSI tutorial readers will learn how to cluster keywords in a k-dimensional reduced space. They will also learn how first- and second-order co-occurrence affects LSI scores. This should demystify the so-called "LSI tools", most of which are based on permutation and synonym lookups, not on LSI at all. Some merely use plain term vectors and even others simply fake the results. Non of these use SVD, so by definition are not using LSI.

Here is a good tip. When we cluster terms using LSI, terms must be in the initial term-document matrix. So, whatever the results these will only be valid for the tested universe. An "LSI tool" that reports terms not present in the original universe is simply appending these from an external source (e.g., by means of a word list lookup) and therefore faking the results.

So far I have not seen any valid LSI tool from any search marketing firm. My feel is that some that have bought one have been "taken".

Also available:

LSI Keyword Research - A Fast Track Tutorial

Both pieces are designed to demystify how LSI clusters keywords. We also debunk a bad SEO advice: The Synonym Myth. According to this myth, to make "LSI-friendly" certain SEOs are advicing the stuffing of documents with synonyms and related terms.

This is a bad advice for two reasons:

1. it shows a lack of understanding on how LSI group terms.

2. LSI clustering power is not due to the nature of the terms, but direct consequence of a co-occurrence phenomenon. Terms do not have to be synonyms to be clustered.

However, let me say this. The use of synonyms and related terms in documents is a sounded technique used by professional writers for centuries and recommended, but one should not stuff documents with these because back in 1988 Dumais and others applied 1965 Golub and Kahan SVD algorithm to a vocabulary problem and called that LSI (or LSA if you wish). Doing so in an arbitrary manner demonstrates a lack of understanding of latent semantic indexing theory.

Once again, stay away from SEO firms that promote so-called "LSI Tools". Do not let these to game you.


Dr. E. Garcia

Edited by orion, 19 October 2006 - 08:34 AM.

#39 Black_Knight


    Honored One Who Served Moderator Alumni

  • Hall Of Fame
  • 9417 posts

Posted 19 October 2006 - 01:14 PM

A lot of the support for using synonyms has nothing to do with LSI, as you note, but is still advised for 'long tail' search terms. People use different words.

There is current interest in studying how words correlate and co-occur with the aim of improving Information Retrieval systems, though not as LSI, as you note. I just mention that in case some webmasters newer to the field mistake your assertion that these are nothing to do with LSI as meaning that it has nothing to do with IR, or SEO. :)

#40 Guest_orion_*

  • Guests

Posted 20 October 2006 - 09:19 AM

Indeed, a lot of use of synonyms and related terms in a copy has nothing to do with LSI.

At this DigitalPoint thread I explained that the use of synonyms and related terms is a common sense practice one should use to improve copy style, but not that one should use because of LSI.

There is no such thing as documents "LSI-friendly"

Some SEOs are giving the wrong advice by saying that one should use synonyms and related terms under the pretension or wrong thesis that this will make a document "Lsi friendly". In fact, when one think thoroughly there is no such thing as making documents "LSI friendly". This is another SEO Myth.

The great thing about a phenomenon taking place at a global level like co-occurrence and IDF (inverse document frequency) is that the chances for end users to manipulate these are close to nada, zero, zip, nothing.

In LSI, co-occurrence (especially second-order co-occurrence) is responsible for the LSI scores assigned to terms, not the nature of the terms itself or whether these are synonyms or related terms. In the early LSI papers this was not fully addressed and emphasis was given to synonyms. Why?

Because the documents selected to conduct those experiments happen to contain synonyms and related terms. It was thought that synonymity association was responsible for the clustering phenomenon. The fact is that this was direct result co-occurrence patterns present in the LSI matrix. In recent years several papers have been published on the subject:

Understanding LSI via the Truncated Term-term Matrix, 2005 Thesis, by Regis Newo (Germany)

A Framework for Understanding Latent Semantic Indexing (LSI) Performance, April Kontostathis and William Pottenger (Lehigh University).

Pottenger and Kontostathis have published a series of papers on the subject.

These two studies explain the role of co-occurrence patterns in the LSI matrix, but differ a bit in some of their findings.

SEOs are still quoting the first LSI papers from the late eighties and early nineties and in the process some have stretched that old research in order to market better whatever they sell.

The following figure from the last tutorial shows that LSI cluster documents, not because these are synonyms, but because first and second order co-occurrence paths present in the term-document matrix, as can be seen from the corresponding eigenvectors and term vectors.

Posted Image

Certainly in this term-document example taken from Grossman and Frieder IR textbook (note: the data is theirs, but the graph and calculations are mine) non of the terms are synonyms. Still LSI was able to cluster terms.

When LSI is applied to a term-document matrix representing a collection of documents in the zillions, the co-occurrence phenomenon that affects the LSI scores becomes a global effect, occuring between documents in the collection.

Thus, the only way that end users (e.g. SEOs) would influence the LSI scores is if they can access and control the content of all the documents of the matrix or launch a coordinated spam attack to the entire collection. The later would be the case of a spammer trying to make an LSI-based search engine to index billion of documents (to say a quantity) he/she have created.

If an end user or research want to understand and manipulate the effect of co-occurrence in a single document, he/she would need to deconstruct a single document and make a term-passage matrix for that single document and to this apply LSI --then play by manipulating single terms. Whatever the results these will only be valid for that universe represented by the matrix, that is for that and only that document.

If such document is then submitted to the LSI-based search engine that local effect simply vanishes and global co-occurrence "takes control" and spreads throughout the collection, forming the corresponding connectivity paths that eventually forces a redistribution of term weights.

Consequently, SEOs that sell this idea of making documents "LSI-friendly" like some firms sending emails reading "is your site LSI optimized?", "we can make your documents LSI-valid!" or those that promote the notion of "LSI and Link Popularity" end exposed for what they are and for how much they know about search engines. The sad thing is that these find their way via search engine conferences (SES), blogs and forums to deceive the industry with such blogonomies. BTW here are Two More LSI Blogonomies.

Dr. E. Garcia

Edited by orion, 05 December 2006 - 12:03 PM.

RSS Feed

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users