Search code examples
pythonscikit-learncosine-similarity

computing cos similarity of 2 row vectors, not all pairwise cdists


I have two pandas df trigger and action that contain 25-dimensional feature vectors written in the rows and want the cosine similarity between correspondent rows. The code below produces the 20675 x 20675 matrix of pairwise cosine similarities:

trigger.shape
(20675, 25)
action.shape
(20675, 25)
from scipy.spatial.distance import cdist
result = cdist(trigger, action, metric='cosine')
result.shape
(20675, 20675)

I would like to end up with a result matrix that has shape 20675 x 1 where each row is the cosine similarity between the corresponding row vectors from trigger and action.

I've searched and can't find a way to do this.


Solution

  • You could compute the cosine similarity by yourself.

    from scipy import lingalg
    cosineSim1 = 1 - np.sum(a * b, axis=-1)/(linalg.norm(a,axis=-1) * linalg.norm(b,axis=-1))
    

    Test whether you get correct values:

    from scipy import spatial
    cosineSim2 = []
    for row_a, row_b in zip(a,b):
        cosineSim2.append(spatial.distance.cosine(row_a, row_b))
    np.allclose(cosineSim1, cosineSim2). # Should output True
    

    Timing tests:

    timeit.timeit(func1, number=100)   # computes cosineSim1
    0.006364107131958008
    
    timeit.timeit(func2, number=100)  # computes cosineSim2
    0.34532594680786133