Search code examples
pythonnumpynlpscipylatent-semantic-indexing

Latent Semantic Analysis in Python discrepancy


I'm trying to follow the Wikipedia Article on latent semantic indexing in Python using the following code:

documentTermMatrix = array([[ 0.,  1.,  0.,  1.,  1.,  0.,  1.],
                            [ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
                            [ 0.,  0.,  0.,  0.,  0.,  1.,  1.],
                            [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
                            [ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
                            [ 1.,  0.,  0.,  1.,  0.,  0.,  0.],
                            [ 0.,  0.,  0.,  0.,  1.,  1.,  0.],
                            [ 0.,  0.,  1.,  1.,  0.,  0.,  0.],
                            [ 1.,  0.,  0.,  1.,  0.,  0.,  0.]])
u,s,vt = linalg.svd(documentTermMatrix, full_matrices=False)

sigma = diag(s)
## remove extra dimensions...
numberOfDimensions = 4
for i in range(4, len(sigma) -1):
    sigma[i][i] = 0
queryVector = array([[ 0.], # same as first column in documentTermMatrix
                     [ 0.],
                     [ 0.],
                     [ 0.],
                     [ 0.],
                     [ 1.],
                     [ 0.],
                     [ 0.],
                     [ 1.]])

How the math says it should work:

dtMatrixToQueryAgainst = dot(u, dot(s,vt))
queryVector = dot(inv(s), dot(transpose(u), queryVector))
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainst[:,0]
# gives 'matrices are not aligned' error.  should be 1 because they're the same

What does work, with math that looks incorrect: ( from here)

dtMatrixToQueryAgainst = dot(s, vt)
queryVector  = dot(transpose(u), queryVector)
similarityToFirst = cosineDistance(queryVector, dtMatrixToQueryAgainsst[:,0]) 
# gives 1, which is correct

Why does route work, and the first not, when everything I can find about the math of LSA shows the first as correct? I feel like I'm missing something obvious...


Solution

  • There are several inconsistencies in your code that cause errors before your point of confusion. This makes it difficult to understand exactly what you tried and why you are confused (clearly you did not run the code as it is pasted, or it would have thrown an exception earlier).

    That said, if I follow your intent correctly, your first approach is nearly correct. Consider the following code:

    documentTermMatrix = array([[ 0.,  1.,  0.,  1.,  1.,  0.,  1.],
                                [ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
                                [ 0.,  0.,  0.,  0.,  0.,  1.,  1.],
                                [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
                                [ 0.,  1.,  1.,  0.,  0.,  0.,  0.],
                                [ 1.,  0.,  0.,  1.,  0.,  0.,  0.],
                                [ 0.,  0.,  0.,  0.,  1.,  1.,  0.],
                                [ 0.,  0.,  1.,  1.,  0.,  0.,  0.],
                                [ 1.,  0.,  0.,  1.,  0.,  0.,  0.]])
    numDimensions = 4
    u, s, vt = linalg.svd(documentTermMatrix, full_matrices=False)
    u = u[:, :numDimensions]
    sigma = diag(s)[:numDimensions, :numDimensions]
    vt = vt[:numDimensions, :]
    lowRankDocumentTermMatrix = dot(u, dot(sigma, vt))
    queryVector = documentTermMatrix[:, 0]
    lowDimensionalQuery = dot(inv(sigma), dot(u.T, queryVector))
    lowDimensionalQuery
    vt[:,0]
    

    You should see that lowDimensionalQuery and vt[:,0] are equal. Think of vt as a representation of the documents in a low-dimensional subspace. First we map our query into that subspace to get lowDimensionalQuery, and then we compare it with the corresponding column of vt. Your mistake was trying to compare the transformed query to the document vector from lowRankDocumentTermMatrix, which lives in the original space. Since the transformed query has fewer elements than the "reconstructed" document, Python complained.