Search code examples
pythonnumpyscikit-learnvectorizationcosine-similarity

Compute all cosine similarities in a matrix


Say I have a matrix mat an 100 x 200 array.

My question is twofold:

  1. How can I compute the cosine similarity of the first row against all the other rows? I tried using sklearn's cosine_similarity function but passing in a 100 x 200 matrix gives me a 100 x 100 array (instead of a 100 x 1 array).

  2. If I wanted to compute the cosine similarities of all the rows against the others, say compute all 100 C 2 = 4950 different combinations of all the rows, would it be fastest not to use something like sklearn but actually store the norms of each of the rows by np.linalg.norm and then compute each similarity by cos_sim = dot(a, b)/(norm(a)*norm(b))?


Solution

  • 1- try:

    cosines = (numpy.inner(mat[0], mat) / (numpy.linalg.norm(mat[0]) * numpy.linalg.norm(mat, axis=1)))
    

    2- you can check the previous code to do similar thing knowing that

    numpy.linalg.norm(mat, axis=1)
    

    computing the norms of all vectors and then you multiply by the current one for each step. also

    numpy.inner(mat, mat)
    

    will give you a symmetric matrix of the matrix inner product.