I want to use Jaccard Index to find the similarity between two sets.
I found a Jaccard Index implementation here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_similarity_score.html
but the input of the funciton of the library have to be a List
, while in my case I prefer Set
I wrote this code:
from sklearn.metrics import jaccard_similarity_score
def jaccard_index(first_set, second_set):
""" Computes jaccard index of two sets
index(float): Jaccard index between two sets; it is
between 0.0 and 1.0
# If both sets are empty, jaccard index is defined to be 1
index = 1.0
if first_set or second_set:
index = (float(len(first_set.intersection(second_set)))
/ len(first_set.union(second_set)))
return index
y_pred = [0, 2, 1, 3, 5]
y_true = [0, 1, 2, 3, 7]
print jaccard_similarity_score(y_true, y_pred)
print jaccard_similarity_score(y_true, y_pred, normalize=False)
These are the outputs of the 3 print:
Why are they different from my implementation (0.666666666667)? Why is the second result 2? Shouldn't the Jaccard Index be between 0 and 1? Which one is the best implementation and which one should I use?
From the documentation:
If normalize == True, return the average Jaccard similarity coefficient,
else it returns the sum of the Jaccard similarity coefficient over the sample set.
By the way, you can see the code of sklearn implementation here
I see now the main problem - it is due to the nature of sets. You have the line a={0,2,1,3,5}. After this a becames equal to {0, 1, 2, 3, 5}, because using set causes automatical sorting of the data. a and b are sorted independently from each other, and as a result similarity is calculated not between original lists, but different lists. So you can't use set, because original position of elements is important.