Search code examples
matlabhistogramsimilaritycosine-similaritypdist

Interpretation of cosine similarity and jaccard similarity (similarity of histograms)


Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

enter image description here

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?


Solution

  • The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

    For instance, assume bin_counts_a as in your example and

    bin_counts_b = bin_counts_a + 1;
    

    Then

    >> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
    cosine_similarity =
       0.999971577948095
    

    is almost 1 as expected, because the bin counts are very similar. However,

    >> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
    jaccard_similarity =
         0
    

    gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

    For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.