How to compute similarities between arrays?

I am trying to compute similarity between two samples. The python functions sklearn.metrics.pairwise.cosine_similarity and scipy.spatial.distance.cosine return results that I am not satisfied with. For example:

In the following I would have expected 0.0%, because the two samples do not have identical samples.

 tt1 = [1, 16, 4, 21]
 tt2 = [5, 17, 3, 22]

 from scipy import spatial
 res = 1-spatial.distance.cosine(tt1, tt2)
 print(res)
 0.9893593529663931

I would have expected 0.25% of similarity because only a single sample, the first one (1), in both arrays are the same.

 tt1 = [1, 16, 4, 21]
 tt2 = [1, 17, 3, 22]

 from scipy import spatial
 res = 1-spatial.distance.cosine(tt1, tt2)
 print(res)
 0.9990578001169402

In the same way we have the following where I would expect 0.5% was expected. Two identical samples (1 and 16)
```
 tt1 = [1, 16, 4, 21]
 tt2 = [1, 16, 3, 22]
 res = 0.9989359418266097
```

Here 0.75% was expected. Three identical samples (1, 16 and 4)

 tt1 = [1, 16, 4, 21]
 tt2 = [1, 16, 4, 22]
 res = 0.9997474232272052

Is there a way in python to achieve those expected results ?

Solution

I think you are misunderstanding what the function computes. By your description you want to compute the misclassfication error / accuracy. However, the function receives two samples u,v and computes the cosine distance between them. In your first examples:

tt1 = [1, 16, 4, 21]
tt2 = [5, 17, 3, 22]

then u=tt1 and v=tt2. The different values of the two arrays are the coordinates in the vector space these samples are in (here a 4 dimensional space) - and not different samples. Refer to function documentation and specifically to the examples at the bottom.

If each coordinate in these arrays represent a different sample then:

If order matters: (consider working with numpy arrays to begin with)
```
np.mean(np.array(tt1) == np.array(tt2))
```

If order does not matter:

 len(np.intersect1d(np.array(tt1), np.array(tt2))) / len(tt1)