Search code examples
pythonmachine-learningnltkinformation-retrievaltf-idf

Python: tf-idf-cosine: to find document similarity


I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from stackoverflow, included is the code mentioned in the above link (just so as to make life easier)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

as a result of the above code I have the following matrix

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

I am not sure how to use this output in order to calculate cosine similarity, I know how to implement cosine similarity with respect to two vectors of similar length but here I am not sure how to identify the two vectors.


Solution

  • WIth the Help of @excray's comment, I manage to figure it out the answer, What we need to do is actually write a simple for loop to iterate over the two arrays that represent the train data and test data.

    First implement a simple lambda function to hold formula for the cosine calculation:

    cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
    

    And then just write a simple for loop to iterate over the to vector, logic is for every "For each vector in trainVectorizerArray, you have to find the cosine similarity with the vector in testVectorizerArray."

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from nltk.corpus import stopwords
    import numpy as np
    import numpy.linalg as LA
    
    train_set = ["The sky is blue.", "The sun is bright."] #Documents
    test_set = ["The sun in the sky is bright."] #Query
    stopWords = stopwords.words('english')
    
    vectorizer = CountVectorizer(stop_words = stopWords)
    #print vectorizer
    transformer = TfidfTransformer()
    #print transformer
    
    trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
    testVectorizerArray = vectorizer.transform(test_set).toarray()
    print 'Fit Vectorizer to train set', trainVectorizerArray
    print 'Transform Vectorizer to test set', testVectorizerArray
    cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)
    
    for vector in trainVectorizerArray:
        print vector
        for testV in testVectorizerArray:
            print testV
            cosine = cx(vector, testV)
            print cosine
    
    transformer.fit(trainVectorizerArray)
    print
    print transformer.transform(trainVectorizerArray).toarray()
    
    transformer.fit(testVectorizerArray)
    print 
    tfidf = transformer.transform(testVectorizerArray)
    print tfidf.todense()
    

    Here is the output:

    Fit Vectorizer to train set [[1 0 1 0]
     [0 1 0 1]]
    Transform Vectorizer to test set [[0 1 1 1]]
    [1 0 1 0]
    [0 1 1 1]
    0.408
    [0 1 0 1]
    [0 1 1 1]
    0.816
    
    [[ 0.70710678  0.          0.70710678  0.        ]
     [ 0.          0.70710678  0.          0.70710678]]
    
    [[ 0.          0.57735027  0.57735027  0.57735027]]