Search code examples
pythonsimilarityhierarchical-clusteringscipy-spatial

How to compute similarities between arrays?


I am trying to compute similarity between two samples. The python functions sklearn.metrics.pairwise.cosine_similarity and scipy.spatial.distance.cosine return results that I am not satisfied with. For example:

  • In the following I would have expected 0.0%, because the two samples do not have identical samples.

     tt1 = [1, 16, 4, 21]
     tt2 = [5, 17, 3, 22]
    
     from scipy import spatial
     res = 1-spatial.distance.cosine(tt1, tt2)
     print(res)
     0.9893593529663931
    
  • I would have expected 0.25% of similarity because only a single sample, the first one (1), in both arrays are the same.

     tt1 = [1, 16, 4, 21]
     tt2 = [1, 17, 3, 22]
    
     from scipy import spatial
     res = 1-spatial.distance.cosine(tt1, tt2)
     print(res)
     0.9990578001169402
    
  • In the same way we have the following where I would expect 0.5% was expected. Two identical samples (1 and 16)

     tt1 = [1, 16, 4, 21]
     tt2 = [1, 16, 3, 22]
     res = 0.9989359418266097
    
  • Here 0.75% was expected. Three identical samples (1, 16 and 4)

     tt1 = [1, 16, 4, 21]
     tt2 = [1, 16, 4, 22]
     res = 0.9997474232272052
    

Is there a way in python to achieve those expected results ?


Solution

  • I think you are misunderstanding what the function computes. By your description you want to compute the misclassfication error / accuracy. However, the function receives two samples u,v and computes the cosine distance between them. In your first examples:

    tt1 = [1, 16, 4, 21]
    tt2 = [5, 17, 3, 22]
    

    then u=tt1 and v=tt2. The different values of the two arrays are the coordinates in the vector space these samples are in (here a 4 dimensional space) - and not different samples. Refer to function documentation and specifically to the examples at the bottom.

    If each coordinate in these arrays represent a different sample then:

    • If order matters: (consider working with numpy arrays to begin with)

      np.mean(np.array(tt1) == np.array(tt2))
      
    • If order does not matter:

       len(np.intersect1d(np.array(tt1), np.array(tt2))) / len(tt1)