3 Pages V < 1 2 3 >  
Reply to this topicStart new topic
> Latest Search Engine Technolgy LSI (Latent Semantic Indexing), An extensive discussion over LSI (Latent Sematic Indexing)

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Jul 10 2006, 12:45 PM
Hi, there.

This might shed some light to the subject, to the meaning of the terms "latent" and "semantic structure" and what one can get from LSI in general: Demystifying LSA, LSI, SVD, PCA, AND IS acronisms

Hope this help. Sorry I cannot follow the discussion. I'm too busy.

This post has been edited by orion: Jul 10 2006, 12:47 PM
Offline Go to the top of the page

Solid Contributor

Group: Members
Joined: 6-July 06
Posts: 50
From: New York
post Jul 11 2006, 02:01 PM
If there's nothing "latent" about ~, why would "~hut" bring up Dominos Pizza?

I'm not an expert here, but it doesn't seem like a direct synonym, and they certainly do not have the same stem.

Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Jul 19 2006, 02:04 PM
I'm working on and off on some tutorials on SVD (which is at the heart of LSI) and PCA (often mistaken for SVD). I'm writing them when I can find time, so it will take a bit of time to complete.

These will show stepwise how-to calculations so SEMs/SEOs could do the analysis without having to pay a dime to anyone (unless they want to). The tut's will get into the nitty gritty of SVD and PCA and perhaps shed some light to the topic at hand.

Cheers.
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
post Jul 19 2006, 03:42 PM
Off Topic offtopicApparently Google are making extensive use of user data at the moment. The latest is that Google are actively watching the browsing habits of any user who has selected the Hebrew language, or uses http://www.google.co.il/ as their default search engine.

Yes, it is finally an application of Latent Semitic Analysis! biggrin.gif
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Aug 11 2006, 02:53 PM
The Singular Value Decomposition and Latent Semantic Indexing Tutorial Series is now available at http://www.miislita.com

So far, only Part 1 and 2 are available. The series is designed to provide the non specialist (IR students and search engine marketers) with how-to calculation instructions and to debunk/demystify the many myths about SVD and LSI many SEMs/SEOs have.

SVD and LSI Tutorial 1: Understanding SVD and LSI

This tutorial introduces you to SVD and LSI. Includes:

1. Search Engine Marketers and their LSI Myths.
2. SVD/LSI Applications and Limitations.
3. A Geometrical Visualization of SVD.


SVD and LSI Tutorial 2: Computing Singular Values

This tutorial shows you how to compute singular values. Includes:

1. Matrix Transposition.
2. The Frobenius Norm.
3. Computing singular values and singular matrices.


Subsequent parts shows how to compute the Full SVD, stepwise how-to calculations on how LSI scores and rank documents and new advances in the field. They will soon be out. I will eventually show you how to compute/play with LSI for your projects without having to pay a dime to anyone.


Enjoy it.

Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Aug 26 2006, 05:01 PM
And here is my case against a portion of the SEO industry that are just LSI-based Snake Oil Marketers




This post has been edited by orion: Aug 26 2006, 05:02 PM
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 11 2006, 12:01 PM
SVD and LSI Tutorial 3: Computing the Full SVD of a Matrix is available. I have summarized in five easy steps the SVD calculations. These include a handy shortcut to reduced computational overhead and as follows:


1. Given a matrix compute its transpose and the "right" matrix.

2. determine the eigenvalues of the "right" matrix and sort these in descending order, in the absolute sense. Square roots these to obtain singular values.

3. Construct the diagonal matrix (S) by placing singular values in descending order along its diagonal. Compute its inverse.

4. use the ordered eigenvalues from step 2 and compute the eigenvectors of the "right" matrix. Place these eigenvectors along the columns of a new matrix (V) and compute its transpose.

5. Compute U using the shortcut described in tutorial. To complete the proof, reconstruct the original matrix by computing its full SVD.


These steps are easier to grasp in the tutorial since visual aids are provided.

I am working on Part 4, which describes how SVD is used in LSI. I am working also on a fast track, when I can find time. In this way readers can use the fast track as a quick reference, rather than reading the entire series. I am pondering an additional part (Part 5). This would include how-to instructions to crunch LSI data using software, so anyone can play with it and test things.

Part 4 includes a working example that explains how LSI scores documents and queries. This is quite a straightforward procedure. The important thing is that in the process many search marketing misconceptions are exposed and myths are dispeled.

Here is one of such myths: that LSI is document indexing. At least in the SEO industry, this myth is in part the result of naming the technique using the "indexing" token, and in part the result of search marketers resourcing to imagination, in order to market their services. Why then IRs use "indexing" when refering to the technique? Wait and see.


Time to demystify more SEO myths and "LSI based" Snake Oil Marketers.
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 20 2006, 12:49 PM
Here is Part 4 of the SVD and LSI Tutorial series.

http://www.miislita.com/information-retrie...lculations.html

Note how straightforward is LSI.

Now anyone can compute LSI scores (or at least understand the basic calculations) with nothing, but just an online matrix calculator. I have included some basic procedures.

This should help SEOs to get out of their head more myths and misconceptions regarding LSI.

Enjoy it and stay away from LSI snake oil sellers.

Dr. E. Garcia
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Sep 20 2006, 01:13 PM
QUOTE(Dr. Garcia)
Well this was a pretty basic tutorial on LSI.


smile.gif

Thanks, Dr. Garcia. Your efforts towards explaining, and attempting to put these concepts in an easier to understand language is appreciated very much. I think that these topics are valuable and worth learning by people who engage in search marketing.

But I also think that there's a fairly steep learning curve when it comes to these topics for most folks who don't have the educational background. sad.gif

There are parts of your tutorials that are very clear and easy to understand, but other parts that require a baseline of knowledge that many might not have. I know that you have some of that baseline material on your site, like information about arrays. In one of your previous sections from this tutorial, you mention those at the top of the tutorial. I wonder if it would be helpful to include a very short section with links and information about those more elementary sections at the top of this page.
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 20 2006, 02:14 PM
Well. This is a tutorial series. The target audience are IR students, researchers and search marketers new to IR and interested in LSI. So I assume that before reading Part 4 they have assimilated Part 1, 2 and 3.

In Part 1 I have placed in big bold red text the warning and links you refers to. However, your suggestion has merits. I am repeating the same warning and links.

With all, I know some readers will try to cut corners, only to find out they need to go back and visit those links.

Cheers

Dr. E. Garcia
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Sep 20 2006, 03:02 PM
Those are good points.

With many multiple part articles, I like to note at the top something like - "This is part three of a series. The other parts are: xxxx xxxx, xxxx xxx, and xxxx" - with links to them.

That's all that I mean there. smile.gif

Some of the math and the arrays can be intimidating without reading your excellent introduction to arrays.
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 20 2006, 04:24 PM
Good point.

I have that approach in several series of articles (Term Vector, Fractals and Keyword Co-Occurrence series), but thought for the tutorials was not necessary since the titles and links between documents of the SVD and LSI series already states "Part 1", "Part 2", an so on.

The simplest way to look at matrix arrays is by thinking of these as mere tables, as mentioned in the matrix tutorial series. This takes away the false perception of "God, matrix arrays are so hard." These are mere tables, where columns (or rows) are viewed as vectors.

Cheers

Dr. E. Garcia
Offline Go to the top of the page

Quarter Grand Poster

Group: Members
Joined: 18-November 05
Posts: 256
post Sep 20 2006, 04:31 PM
I must say as bragadocchio, thank's for putting this online.

I'm neither IR student, researcher or a search marketer, but I still find these articles very helpful. Not sure I understand everything (and I don't have to since this isn't my work field), but sometimes it's just enough to get the big picture, and I get that also from your articles. Have referenced some to the articles when the LSI SEO question came up (on some other forums).

This post has been edited by Mano70: Sep 20 2006, 04:32 PM
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 31-August 02
Posts: 15,634
post Sep 20 2006, 04:44 PM
QUOTE
I must say as bragadocchio, thank's for putting this online.


Just to avoid and confusion from others who read this thread, in case there is any, the tutorials are Dr. Garcia's.

And yes, they are very helpful. I've been getting a lot out of them myself. And yes, there are parts that I'm not sure that I understand completely either, but I think it's worth trying. smile.gif
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 21 2006, 09:50 AM
Thanks.

Unfortunately, to replicate LSI how-to calculations one needs to know basic linear algebra operations. Without this background some parts might be uneasy to follow, as Bill pointed out. I could have resource to a blackbox approach, but that would dilute the main goals of the series.

Here is a blackbox approach you could try.

Warning: Term Count Model Ahead, so this is not how LSI is used these days, but can give you some ideas on how things are done.

1. Query a search engine and collect top 3 ranked titles. Treat these as "documents", so your "collection" consists of 3 documents. (D=3)

2. Remove all stopwords and punctuation. Lowercase results. Ignore stemming.

3. Construct with survival terms a term-document matrix.

4. Use the Bluebit Calculator or any calculator that does SVD. Copy/paste the matrix in the calculator. (Check last tutorial for setting options). Submit.

5. You should see three matrices. U, S and V. Keep the first 2 columns of U and V and first 2 columns and rows of S. Compute query-document cosine similarity values and sort results in descending order. Check tutorials for details.

If you don't know matrix operations, do this:

1. Visit a site that has matrix calculators and does matrix multiplication. There are plenty around.

2. Copy and paste the generated query matrices to multiply these and come up with query vector coordinates. Check tutorial to identify these matrices.

3. The rows of V gives you document vector coordinates. Compute query-document cosine similarity values and sort results in descending order.

Note this procedure uses the Term Count Model and inherits many theoretical limitations, but at least gives you some ideas on how things work.

Hope this help.

Dr. E. Garcia

This post has been edited by orion: Sep 21 2006, 09:57 AM
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Sep 22 2006, 09:03 AM
The quick reference for the series on SVD and LSI, the LSI Fast Track Tutorial, is now available at

http://www.miislita.com/information-retrie...ck-tutorial.pdf

Note that there are no "magic words" in LSI.

Note also how term vector theory is still used, at the begining and at the end of the SVD decomposition.

Dr. E. Garcia

Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Oct 3 2006, 12:50 PM
Here is Mike Grehan recent ClickZ column Lies, Lies, and LSI. In that article Mike and Randfish take positions on the whole issue of certain SEOs trying to market LSI services. Here is my response to few comments made by Randfish.

Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Oct 19 2006, 08:25 AM
Finally, here is the last article of the tutorial series on SVD and LSI:

LSI Keyword Research and Co-Occurrence Theory

In this LSI tutorial readers will learn how to cluster keywords in a k-dimensional reduced space. They will also learn how first- and second-order co-occurrence affects LSI scores. This should demystify the so-called "LSI tools", most of which are based on permutation and synonym lookups, not on LSI at all. Some merely use plain term vectors and even others simply fake the results. Non of these use SVD, so by definition are not using LSI.

Here is a good tip. When we cluster terms using LSI, terms must be in the initial term-document matrix. So, whatever the results these will only be valid for the tested universe. An "LSI tool" that reports terms not present in the original universe is simply appending these from an external source (e.g., by means of a word list lookup) and therefore faking the results.

So far I have not seen any valid LSI tool from any search marketing firm. My feel is that some that have bought one have been "taken".

Also available:

LSI Keyword Research - A Fast Track Tutorial

Both pieces are designed to demystify how LSI clusters keywords. We also debunk a bad SEO advice: The Synonym Myth. According to this myth, to make "LSI-friendly" certain SEOs are advicing the stuffing of documents with synonyms and related terms.

This is a bad advice for two reasons:

1. it shows a lack of understanding on how LSI group terms.

2. LSI clustering power is not due to the nature of the terms, but direct consequence of a co-occurrence phenomenon. Terms do not have to be synonyms to be clustered.

However, let me say this. The use of synonyms and related terms in documents is a sounded technique used by professional writers for centuries and recommended, but one should not stuff documents with these because back in 1988 Dumais and others applied 1965 Golub and Kahan SVD algorithm to a vocabulary problem and called that LSI (or LSA if you wish). Doing so in an arbitrary manner demonstrates a lack of understanding of latent semantic indexing theory.

Once again, stay away from SEO firms that promote so-called "LSI Tools". Do not let these to game you.

Cheers

Dr. E. Garcia

This post has been edited by orion: Oct 19 2006, 08:34 AM
Offline Go to the top of the page

Moderator Alumni

Group Icon
Group: Hall Of Fame
Joined: 1-September 02
Posts: 9,213
From: UK
post Oct 19 2006, 01:14 PM
A lot of the support for using synonyms has nothing to do with LSI, as you note, but is still advised for 'long tail' search terms. People use different words.

There is current interest in studying how words correlate and co-occur with the aim of improving Information Retrieval systems, though not as LSI, as you note. I just mention that in case some webmasters newer to the field mistake your assertion that these are nothing to do with LSI as meaning that it has nothing to do with IR, or SEO. smile.gif
Offline Go to the top of the page

Member

Group: Members
Joined: 30-June 05
Posts: 38
post Oct 20 2006, 09:19 AM
Indeed, a lot of use of synonyms and related terms in a copy has nothing to do with LSI.

At this DigitalPoint thread I explained that the use of synonyms and related terms is a common sense practice one should use to improve copy style, but not that one should use because of LSI.

There is no such thing as documents "LSI-friendly"

Some SEOs are giving the wrong advice by saying that one should use synonyms and related terms under the pretension or wrong thesis that this will make a document "Lsi friendly". In fact, when one think thoroughly there is no such thing as making documents "LSI friendly". This is another SEO Myth.

The great thing about a phenomenon taking place at a global level like co-occurrence and IDF (inverse document frequency) is that the chances for end users to manipulate these are close to nada, zero, zip, nothing.

In LSI, co-occurrence (especially second-order co-occurrence) is responsible for the LSI scores assigned to terms, not the nature of the terms itself or whether these are synonyms or related terms. In the early LSI papers this was not fully addressed and emphasis was given to synonyms. Why?

Because the documents selected to conduct those experiments happen to contain synonyms and related terms. It was thought that synonymity association was responsible for the clustering phenomenon. The fact is that this was direct result co-occurrence patterns present in the LSI matrix. In recent years several papers have been published on the subject:


Understanding LSI via the Truncated Term-term Matrix, 2005 Thesis, by Regis Newo (Germany)

A Framework for Understanding Latent Semantic Indexing (LSI) Performance, April Kontostathis and William Pottenger (Lehigh University).

Pottenger and Kontostathis have published a series of papers on the subject.

These two studies explain the role of co-occurrence patterns in the LSI matrix, but differ a bit in some of their findings.

SEOs are still quoting the first LSI papers from the late eighties and early nineties and in the process some have stretched that old research in order to market better whatever they sell.


The following figure from the last tutorial shows that LSI cluster documents, not because these are synonyms, but because first and second order co-occurrence paths present in the term-document matrix, as can be seen from the corresponding eigenvectors and term vectors.


IPB Image


Certainly in this term-document example taken from Grossman and Frieder IR textbook (note: the data is theirs, but the graph and calculations are mine) non of the terms are synonyms. Still LSI was able to cluster terms.

When LSI is applied to a term-document matrix representing a collection of documents in the zillions, the co-occurrence phenomenon that affects the LSI scores becomes a global effect, occuring between documents in the collection.

Thus, the only way that end users (e.g. SEOs) would influence the LSI scores is if they can access and control the content of all the documents of the matrix or launch a coordinated spam attack to the entire collection. The later would be the case of a spammer trying to make an LSI-based search engine to index billion of documents (to say a quantity) he/she have created.

If an end user or research want to understand and manipulate the effect of co-occurrence in a single document, he/she would need to deconstruct a single document and make a term-passage matrix for that single document and to this apply LSI --then play by manipulating single terms. Whatever the results these will only be valid for that universe represented by the matrix, that is for that and only that document.

If such document is then submitted to the LSI-based search engine that local effect simply vanishes and global co-occurrence "takes control" and spreads throughout the collection, forming the corresponding connectivity paths that eventually forces a redistribution of term weights.

Consequently, SEOs that sell this idea of making documents "LSI-friendly" like some firms sending emails reading "is your site LSI optimized?", "we can make your documents LSI-valid!" or those that promote the notion of "LSI and Link Popularity" end exposed for what they are and for how much they know about search engines. The sad thing is that these find their way via search engine conferences (SES), blogs and forums to deceive the industry with such blogonomies. BTW here are Two More LSI Blogonomies.

Dr. E. Garcia

This post has been edited by orion: Dec 5 2006, 12:03 PM
Offline Go to the top of the page
Fast ReplyReply to this topic Start new topic
3 Pages V < 1 2 3 >
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:
Jump to Forum:
 
Lo-Fi Version Time is now: 9th February 2010 - 06:20 PM
Meet our Moderators: cre8pc : projectphp : sanity : Black Phoenix : bwelford : EGOL : Ruud : rustybrick : AbleReach : swainzy : joedolson: eKstreme: dazzlindonna : SEOigloo: iamlost : RisaBB
Cre8asite RSS Feed