Search code examples
nlpsimilaritycosine-similaritylingpipelatent-semantic-analysis

Using Latent Semantic Analysis to measure passage similarity


Im currently developing a program to compare two pieces of text based on its semantics (meaning). I understand there are libraries such as lingpipe which provide useful methods to compare string distances, however i've heard that LSA is the best method to measure text similarity.

I just have one confusion with using LSA to measure text similarity. I understand that the process is, with LSA,

1.Two passages are represented as two matrices X and Y. 

2.Using SVD, the matrices each are reduced to 3 different matrices 

3.And then the cosine distance is measured between the two matrices

4. The cosine distance determines how similar they are

I just want to know...

A. in SVD the matrix is reduced to 3 smaller matrix. So which of these smaller matrix is used in the cosine distance measurement?

B. Cosine distance is usually applied to vectors. So in the case of applying them to matrices, i assumed the matrix is iterated through and cosine distance is measured between every 2 vectors. And then the average of all these distances is assumed to be the final cosine distance between these two matrices?

I understand this is a very niche topic, but im hoping for some light on this 2 questions. Thanks


Solution

  • I think you started off on the wrong foot.

    The collection of passages is represented as a type x document matrix. That is, rows represent the 'words' of the collection; columns represent the passages of the collection.

    (Here you might want to apply the TF-IDF weighting scheme to the matrix.)

    Using SVD you can decompose such a matrix (M) into three matrices (U,S, and V) so that

    M = U * S * Vt

    S is a diagonal matrix of the singular values of M sorted in decreasing order. You can perform dimension reduction by keeping the k first singular values and setting the others to 0.

    Now you can regenerate the type x document matrix using the previous equation and start computing cosine similarity between row vectors (i.e. type similarity) or column vectors (i.e. passage similarity).