I was trying to understand what is going on with sklearn's jaccard_score.
This is the result I got
1. jaccard_score([0 1 1], [1 1 1])
0.6666666666666666
2. jaccard_score([1 1 0], [1 0 0])
0.5
3. jaccard_score([1 1 0], [1 0 1])
0.3333333333333333
I understand that the formula is
intersection / size of A + size of B - intersection
I thought the last one should give me 0.2 because the overlap is 1 and total number of entries is 6 resulting 1/5. but I got 0.33333...
Can anyone explain how sklearn calculates jaccard_score?
Per sklearn's doc, the jaccard_score
function "is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true
". If the attributes are binary, the computation is based on this using the confusion matrix. Otherwise, the same computation is done using the confusion matrix for each attribute value / class label.
The above definition for binary attributes / classes can be reduced to the set definition as explained in the following.
Assume that there are three records r1
, r2
, and r3
. The vector [0, 1, 1]
and [1, 1, 1]
-- which are true and predicted classes of the records -- can be mapped to two sets {r2, r3}
and {r1, r2, r3}
respectively. Here, each element in the vector represents whether the correponding record exists in the set. The Jaccard similarity of the two sets are the same as the definition of similarity value for two vectors.