Search code examples
pythoncosine-similaritypairwise

How can I calculate pairwise cosine similarity across multiple vectors in Python?


For the purposes of keeping it simple I have four vectors -- W, X, Y, Z -- that contain a number of values (each the same length). I'm trying to calculate cosine similarity across them pairwise in Python, but I can't seem to get the right answer.

If I try comparing W vs. X:

print(np.dot(W, X.T)/(np.linalg.norm(W)*np.linalg.norm(X)))

I get the following result:

[[0.9984622004973391]]

If I compare W vs. Y I get:

[[0.8891911653057049]]

And if I compare W to Z I get:

[[0.9676746591879851]]

I of course don't want to do these manually one by one, however, as I have many vectors in reality.

When I try to calculate all three (X, Y, Z) vs. W at once:

V = pd.concat([X, Y, Z])
print(np.dot(W, V.T)/(np.linalg.norm(W)*np.linalg.norm(V)))

I get the following:

[[0.9982175434442747 0.005561082504669956 0.020547860729214433]]

...where the first nearly matches what I had gotten running them singularly (but still not quite), while the others are way off.

I must have an issue with my approach to the all at once version, but I have not been able to figure out how to fix it. Any ideas? Thanks!


Solution

  • When you execute np.dot(W, V.T), gets three values like

    [[3.9353 2.4442 2.418 ]]
    

    For each value, you must have a different normalization (for X, Y, Z), when you call np.linalg.norm(V) you get just one value (norm of Matrix V). To calculate the norm for each of the vectors (located in each line), you must add the parameter axis=1.

    Finnaly the correct and short code looks like this:

    V = np.concatenate([X, Y, Z])
    cos_sim = (W @ V.T)/(np.linalg.norm(W)*np.linalg.norm(V, axis=1))
    print(cos_sim)