Search code examples
pythonmatplotlibdata-sciencesimilarity

Measuring similarity between binary lists


I've two binary lists that I'm attempting to compare. To compare I sum where each corresponding value is equal and transform this to a percentage :

import numpy as np

l1 = [1,0,1]
l2 = [1,1,1]

print(np.dot(l1 , l2) / len(l1) * 100)

prints 66.666

So in this case l1 and l2 are 61.666 in terms of closeness. As each list is less similar the closeness value decreases.

For example using values :

l1 = [1,0,1]
l2 = [0,1,0]

returns 0.0

How to plot l1 and l2 that describe the relationship between l1 and l2 ? Is there a name for using this method to measure similarity between binary values ?

Using a scatter :

import matplotlib.pyplot as plt

plt.scatter( 'x', 'y', data=pd.DataFrame({'x': l1, 'y': l2 }))

produces :

enter image description here

But this does not make sense ?

Update :

"if both entries are 0, this will not contribute to your "similarity"

Using updated code below in order to compute similarity, this updated similarity measure includes corresponding 0 values in computing final score.

import numpy as np

l1 = [0,0,0]
l2 = [0,1,0]

print(len([a for a in np.isclose(l1 , l2) if(a)]) / len(l1) * 100)

which returns :

66.66666666666666

Alternatively, using below code with measure normalized_mutual_info_score returns 1.0 for lists that are the same or different, therefore normalized_mutual_info_score is not a suitable similarity measure ?

from sklearn.metrics.cluster import normalized_mutual_info_score

l1 = [1,0,1]
l2 = [0,1,0]

print(normalized_mutual_info_score(l1 , l2))

l1 = [0,0,0]
l2 = [0,0,0]

print(normalized_mutual_info_score(l1 , l2))

prints :

1.0
1.0

Solution

  • No, the plot does not make sense. What you are doing is essentially an inner product between vectors. According to this metric l1 and l2 are supposed to be vectors in a 3D (in this case) space, and this measures whether they face the same a similar direction and have similar length. The output is a scalar value so there's nothing to plot.

    If you want to show the individual contribution of each component, you could do something like

    contributions = [a==b for a, b in zip(l1, l2)]
    plt.plot(list(range(len(contributions)), contributions)
    

    but i'm still not sure that this makes sense.