Search code examples
python-2.7scikit-learnsimilarity

Jaccard similarity in python


I am trying to find the jaccard similarity between two documents. However, i am having hard time to understand how the function sklearn.metrics.jaccard_similarity_score() works behind the scene.As per my understanding the Jaccard's sim = intersection of the terms in docs/ union of the terms in docs.

Consider below example: My DTM for the two documents is:

array([[1, 1, 1, 1, 2, 0, 1, 0],
       [2, 1, 1, 0, 1, 1, 0, 1]], dtype=int64)

above func. give me the jaccard sim score

print(sklearn.metrics.jaccard_similarity_score(tf_matrix[0,:],tf_matrix[1,:]))
0.25

I am trying to find the score on my own as :

intersection of terms in both the docs = 4
total terms in doc 1 = 6
total terms in doc 2 = 6
Jaccard = 4/(6+6-4)= .5

Can someone please help me understand if there is something obvious i am missing here.


Solution

  • As stated here:

    In binary and multiclass classification, the Jaccard similarity coefficient score is equal to the classification accuracy.

    Therefore in your example it is calculating the proportion of matching elements. That's why you're getting 0.25 as the result.