matlab histogram similarity cosine-similarity pdist

Interpretation of cosine similarity and jaccard similarity (similarity of histograms)

Introduction

I would like to assess the similarity between two "bin counts" arrays (related to two histograms), by using the Matlab "pdist2" function:

% Input
bin_counts_a = [689   430   311   135    66    67    99    23    37    19     8     4     3     4     1     3     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1];
bin_counts_b = [569   402   200   166   262    90    50    16    33    12     6    35    49     4    12     8     8     2     1     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     1];

% Visualize the two "bin counts" vectors as bars:
bar(1:length(bin_counts_a),[bin_counts_a;bin_counts_b])

% Calculation of similarities
cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')

% Output
cosine_similarity =

          0.95473215802008


jaccard_similarity =

        0.0769230769230769

Question

If the cosine similarity is close to 1, which means the two vectors are similar, shouldn't the jaccard similarity be closer to 1 as well?

Solution

The 'jaccard' measure, according to the documentation, only considers the "percentage of nonzero coordinates that differ", but not by how much they differ.

For instance, assume bin_counts_a as in your example and

bin_counts_b = bin_counts_a + 1;

Then

>> cosine_similarity  = 1 - pdist2(bin_counts_a,bin_counts_b,'cosine')
cosine_similarity =
   0.999971577948095

is almost 1 as expected, because the bin counts are very similar. However,

>> jaccard_similarity = 1 - pdist2(bin_counts_a,bin_counts_b,'jaccard')
jaccard_similarity =
     0

gives 0 because each entry in bin_counts_b is (slightly) different from that in bin_counts_a.

For assessing the similarity between the histograms, 'cosine' is probably a more meaningful option than 'jaccard'. You may also want to consider the Kullback-Leibler divergence, although it is not symmetric in the two distributions, and is not computed by pdist2.