Search code examples
pythonpandasnumpysklearn-pandascosine-similarity

ranking similarity of one vector with a very large dataframe of vectors in panda


Objective: I'm trying to create an ordered list of items that are ranked based on how close they are with a test item.

I have 1 test item with 10 attributes and 250,000 items with 10 attributes. I want a list that ranks the 250,000 items. For example, if the resulting list came back [10,50,21,11,10000....] than the item with index 10 would be closest to my test item, index 50 is second closest to my test item, etc.

What I have tried works for small dataframes but not larger dataframes:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = pd.np.random.rand(4,4) 

#4 items with the first being the test
#0.727048   0.113704    0.886672    0.0345438
#0.496636   0.678949    0.0627973   0.547752
#0.641021   0.498811    0.628728    0.575058
#0.760778   0.955595    0.646792    0.126714 

#creates the cosine similarity matrix 
winner = cosine_similarity(similarity_matrix) 

#I just need the first row, how similar each item is to the test, I'm excluding how similar the test is to the test 
winner = np.argsort(winner[0:1,1:])

#I want to reverse the order and add one so the list matches the original index    
winner = np.flip(winner) +1

Unfortunately, with 250,000 I get the following error "MemoryError: Unable to allocate 339. GiB for an array with shape (250000, 250000) and data type float64"

Instead of creating a 250000X250000 matrix I really only need the first row. Is there another way of doing this?


Solution

  • If you call cosine_similarity with a second argument it will only compute the distance against the second array.
    An example with random vectors

    x = np.random.rand(5,2)
    

    With one argument

    cosine_similarity(x)
    array([[1.        , 0.95278802, 0.93496787, 0.45860786, 0.62841819],
           [0.95278802, 1.        , 0.99853581, 0.70677904, 0.8349406 ],
           [0.93496787, 0.99853581, 1.        , 0.74401257, 0.86348853],
           [0.45860786, 0.70677904, 0.74401257, 1.        , 0.979448  ],
           [0.62841819, 0.8349406 , 0.86348853, 0.979448  , 1.        ]])
    

    With the first vector as second argument

    cosine_similarity(x, [x[0]])
    array([[1.        ],
           [0.95278802],
           [0.93496787],
           [0.45860786],
           [0.62841819]])
    

    If you're still running out of memory you can compute the distance in chunks

    chunks = 4
    np.concatenate(
        [cosine_similarity(i, [x[0]]) for i in np.array_split(x, chunks)]
    )
    array([[1.        ],
           [0.95278802],
           [0.93496787],
           [0.45860786],
           [0.62841819]])