Search code examples
pythonpandasscikit-learnnlpcosine-similarity

Didnt get the expected results when calculate Cosine similarity between strings


I want to calculate the pairwise cosine similarity between two strings that are in the same row of a pandas data frame.

I used the following lines of codes:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


pd.set_option('display.float_format', '{:.4f}'.format)


df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog', 'The red apple', 'The big blue sky'],
                   'text2': ['The lazy cat jumps over the brown dog', 'The red apple', 'The big yellow sun']})


vectorizer = CountVectorizer().fit_transform(df['text1'] + ' ' + df['text2'])


cosine_similarities = cosine_similarity(vectorizer)[:, 0:1]


df['cosine_similarity'] = cosine_similarities


print(df)  

It gave me following output, which seems incorrect:

enter image description here

Can anyone help me to figure out what I did incorrectly?

Thank you.


Solution

  • I'm no expert, but here's one way to do it.

    import pandas as pd
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    
    pd.set_option('display.float_format', '{:.4f}'.format)
    
    df = pd.DataFrame({'text1': ['The quick brown fox jumps over the lazy dog',
                                 'The red apple',
                                 'The big blue sky'],
                       'text2': ['The lazy cat jumps over the brown dog',
                                 'The red apple',
                                 'The big yellow sun']})
    
    vectorizer = CountVectorizer()
    
    # np.hstack([df["text1"], df["text2"]]) puts all "text2" after "text1"
    X = vectorizer.fit_transform(np.hstack([df["text1"], df["text2"]]))
    
    cs = cosine_similarity(X)  # full symmetric numpy.ndarray
    
    # The values you want are on an offset diagonal of cs since
    # "text2" strings were stacked at the end of "text1" strings
    
    pairwise_cs = cs.diagonal(offset=len(df))
    df["cosine_similarity"] = pairwise_cs
    
    print(df)
    

    which shows:

                                             text1                                  text2  cosine_similarity
    0  The quick brown fox jumps over the lazy dog  The lazy cat jumps over the brown dog             0.8581
    1                                The red apple                          The red apple             1.0000
    2                             The big blue sky                     The big yellow sun             0.5000