Search code examples
scikit-learncosine-similarity

Cosine similarity TSNE in sklearn.manifold


I have a small problem to perform TSNE on my dataset, using cosine similarity.

I have calculated the cosine similarity of all of my vectors, so I have a square matrix which contains my cosine similarity :

A = [[  1    0.7   0.5   0.6  ]
     [  0.7   1    0.3   0.4  ]
     [  0.5  0.3    1    0.1  ]
     [  0.6  0.4   0.1    1   ]]

Then, I'm using TSNE like that :

A = np.matrix([[1, 0.7,0.5,0.6],[0.7,1,0.3,0.4],[0.5,0.3,1,0.1],[0.6,0.4,0.1,1]])
model = manifold.TSNE(metric="precomputed")
Y = model.fit_transform(A) 

But I'm not sure that to use precomputed metric keep the sense of my cosine similarity:

#[documentation][1]
If metric is “precomputed”, X is assumed to be a distance matrix

But when I try to use cosine metric, I got an error :

A = np.matrix([[1, 0.7,0.5,0.6],[0.7,1,0.3,0.4],[0.5,0.3,1,0.1],[0.6,0.4,0.1,1]])
model = manifold.TSNE(metric="cosine")
Y = model.fit_transform(A) 

raise ValueError("All distances should be positive, either "
ValueError: All distances should be positive, either the metric or 
precomputed distances given as X are not correct

So my question is, How is it possible to perform TSNE using cosine metric on an existent dataset (similarity matrix) ?


Solution

  • I can answer the majority of your question, however I'm not quite sure why that error is popping up in your second example.

    You have calculated the cosine similarity of each of your vectors, but scikit assumes a distance matrix for the input to TSNE. However this is a really simple transformation distance = 1 - similarity. So for your example

    import numpy as np
    from sklearn import manifold
    A = np.matrix([[1, 0.7,0.5,0.6],[0.7,1,0.3,0.4],[0.5,0.3,1,0.1],[0.6,0.4,0.1,1]])
    A = 1.-A
    model = manifold.TSNE(metric="precomputed")
    Y = model.fit_transform(A) 
    

    This should give you the transformation you want.