I have the following matrix
a =
0 10 10 0 0
0 5 5 0 0
1 0 0 50 51
0 0 10 100 100
I compute the Jaccard distances
D = pdist(a,'jaccard');
D =
1.0000 1.0000 0.7500 1.0000 1.0000 1.0000
and finally I put the distances in a matrix
sim = squareform(D)
sim =
0 1.0000 1.0000 0.7500
1.0000 0 1.0000 1.0000
1.0000 1.0000 0 1.0000
0.7500 1.0000 1.0000 0
The jaccard index is computed as "One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ." (http://www.mathworks.it/help/stats/pdist.html)
The distance between row 1 and 4 is correct (0.75), while the distance between row 1 and 2 should be 0 and is, instead, 1. It seems that when the jaccard similarity is 1, matlab doesn't execute the 1-similarity computation. What am I doing wrong?
MATLAB seems right to me.
All of the non-zero numbers in rows 1 and 2 differ (in row 1 they're all 10, in row 2 they're all 5), so rows 1 and 2 should have a distance of 1.
Three out of four of the non-zero numbers in rows 1 and 4 differ (10:0, 10:10, 0:100, 0:100), so rows 1 and 4 should have a distance of 0.75.
There seems to be a lot of disagreement about what thing is the Jaccard "coefficient", the Jaccard "index", the Jaccard "similarity" and the Jaccard "distance", and which is one minus the other. MATLAB's documentation doesn't help, as it's not obvious, in the sentence you quote, whether "which" refers to (what MATLAB is describing as) the Jaccard coefficient, or to one minus the Jaccard coefficient.
In any case, whether the terminology used by the MATLAB documentation is correct, the function pdist
seems to be giving consistent results, and you can always take one minus whatever it outputs if you want something different.