How calculating pairwise cosine similarities for the huge amount of vectors can be optimized(estimates suits)?
For two sets (A, B), containing vectors - pairwise cosine similarity sim(a_i, b_j) need to be generated for every a and b. (Cosine similarity matrix also suits since it's easy to transform from a matrix to pairwise.)
It looks like a common problem because of the need for calculation of such distances in computational biology, recommendation systems, etc. But I haven't managed to find some reasonable solution for this.
By definition complexity of this problem is O(len_A * len_B * O(similarity_function)) so 10^6 vectors in both A and B sets tend to huge running time
It looks like, we are doing a lot of useless work here since similarities aren't independent (if we have a similarity of a_i calculated for million vectors, and b_j very similar to a_i - and we have b_j similarity for 900k of vectors calculated we can estimate b_j similarity to rest 100k vectors). I assume that something like indexing may be used here.
Your entries are appreciated.
python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
df_1 = pd.DataFrame({'object_id_1': range(10),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
df_2 = pd.DataFrame({'object_id_2': range(10,20),
'feature_0': np.random.uniform(0,1,10),
'feature_1': np.random.uniform(0,1,10),
'feature_2': np.random.uniform(0,1,10),
'feature_3':np.random.uniform(0,1,10)})
def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
'''
This function generates features based similarity scores, between two groups of objects
Parameters
----------
df_1: pandas.DataFrame
DataFrame with features, and id_s of objects
df_2: pandas.DataFrame
DataFrame with features, and id_s of objects which has no id_s same to df_1
meaningful_features: list
Features columns to calculate similarity on
Returns
----------
similarities_of_objects: pandas.DataFrame
DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity',
where we have features similarity, for each object_1-object_2 pair.
Similarity - symmetric.
'''
objects_1 = [] # list of all objects from df_1
objects_2 = [] # list of all objects from df_2
similarities = [] # list of scores for object_1-object_2 pairs
for object_1 in df_1['object_id_1'].unique():
features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
for object_2 in df_2['object_id_2'].unique():
features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
objects_1.append(object_1)
objects_2.append(object_2)
similarities.append(cosine_similarity(X = np.array(features_vector_1)
,Y = np.array(features_vector_2)).item()) # similarities of vectors
sim_o1_to_o2 = pd.DataFrame()
sim_o1_to_o2['objects_1']= objects_1
sim_o1_to_o2['objects_2']= objects_2
sim_o1_to_o2['similarity']= similarities
return sim_o1_to_o2
get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])
Use Faiss
import faiss
dimension = 100
value1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatL2(d)
index.add(value1)
xq = value2
k= len(value1)
D, I = index.search(xq, k)
Note that here D is the distance, and I is the Index of the values.
Also, value1 and value2 are nothing but NumPy arrays.
PS: Install faiss first.
pip install faiss