python scikit-learn similarity information-theory

how does sklearn jaccard_score gets calculated?

I was trying to understand what is going on with sklearn's jaccard_score.

This is the result I got

1. jaccard_score([0 1 1], [1 1 1])
0.6666666666666666

2. jaccard_score([1 1 0], [1 0 0])
0.5

3. jaccard_score([1 1 0], [1 0 1])
0.3333333333333333

I understand that the formula is

intersection / size of A + size of B - intersection

I thought the last one should give me 0.2 because the overlap is 1 and total number of entries is 6 resulting 1/5. but I got 0.33333...

Can anyone explain how sklearn calculates jaccard_score?

Solution

Per sklearn's doc, the jaccard_score function "is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true". If the attributes are binary, the computation is based on this using the confusion matrix. Otherwise, the same computation is done using the confusion matrix for each attribute value / class label.

The above definition for binary attributes / classes can be reduced to the set definition as explained in the following.

Assume that there are three records r1, r2, and r3. The vector [0, 1, 1] and [1, 1, 1] -- which are true and predicted classes of the records -- can be mapped to two sets {r2, r3} and {r1, r2, r3} respectively. Here, each element in the vector represents whether the correponding record exists in the set. The Jaccard similarity of the two sets are the same as the definition of similarity value for two vectors.