Search code examples
pythonpandasalgorithmnumpysimilarity

Pairwise similarity/similarity matrix calculation optimization


Problem definition


Question

How calculating pairwise cosine similarities for the huge amount of vectors can be optimized(estimates suits)?

Formal definition

For two sets (A, B), containing vectors - pairwise cosine similarity sim(a_i, b_j) need to be generated for every a and b. (Cosine similarity matrix also suits since it's easy to transform from a matrix to pairwise.)


Why am I asking for help


It looks like a common problem because of the need for calculation of such distances in computational biology, recommendation systems, etc. But I haven't managed to find some reasonable solution for this.

Problem I can't solve

By definition complexity of this problem is O(len_A * len_B * O(similarity_function)) so 10^6 vectors in both A and B sets tend to huge running time

My assumptions for future direction

It looks like, we are doing a lot of useless work here since similarities aren't independent (if we have a similarity of a_i calculated for million vectors, and b_j very similar to a_i - and we have b_j similarity for 900k of vectors calculated we can estimate b_j similarity to rest 100k vectors). I assume that something like indexing may be used here.



Additional details


  1. A and B aren't intersecting.
  2. Vectors dimensionality is already reduced to a minimal reasonable value.
  3. There's no need in simple for loop optimization. Briefly - here's short guide for optimization of this - simplest loops given for clear illustration of an algorithm.
  4. I'm interested if there's an algorithm which allows estimation as well, so it's ok if we have similarity close enough but not exactly the same to real.
  5. There's no need in parallelization.
  6. I understand that the resulting similarity matrix will be large in size.
  7. I'm also interested if that's an algorithm that allows getting only top similar vectors from set B for every vector from set A.

Your entries are appreciated.


Code samples


Requirements

python==3.6
pandas==0.25.0
scikit-learn==0.21.3
numpy==1.17.1

Generating dummy data

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity 

df_1 = pd.DataFrame({'object_id_1': range(10),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

df_2 = pd.DataFrame({'object_id_2': range(10,20),
                   'feature_0': np.random.uniform(0,1,10),
                   'feature_1': np.random.uniform(0,1,10),
                   'feature_2': np.random.uniform(0,1,10),
                   'feature_3':np.random.uniform(0,1,10)})

Function for similarities generation

def get_similarities(df_1: pd.DataFrame, df_2: pd.DataFrame, meaningful_features:list) -> pd.DataFrame:
    '''
    This function generates features based similarity scores, between two groups of objects
    
    Parameters
    ----------
    df_1: pandas.DataFrame
        DataFrame with features, and id_s of objects
    df_2: pandas.DataFrame
        DataFrame with features, and id_s of objects which has no id_s same to df_1
    meaningful_features: list
        Features columns to calculate similarity on
        
    Returns
    ----------
        similarities_of_objects: pandas.DataFrame
            DataFrame, with columns 'object_id_1', 'object_id_2', 'similarity', 
            where we have features similarity, for each object_1-object_2 pair. 
            Similarity - symmetric.  
    '''

    objects_1 = [] #  list of all objects from df_1
    objects_2 = [] #  list of all objects from df_2
    similarities = [] #  list of scores for object_1-object_2 pairs

    for object_1 in df_1['object_id_1'].unique():
        features_vector_1 = df_1[df_1['object_id_1'] == object_1][meaningful_features] # object_1 features vector
        
        for object_2 in df_2['object_id_2'].unique():
            features_vector_2 = df_2[df_2['object_id_2'] == object_2][meaningful_features] # object_2 features vector
            
            objects_1.append(object_1)
            objects_2.append(object_2)
            similarities.append(cosine_similarity(X = np.array(features_vector_1)
                                    ,Y = np.array(features_vector_2)).item()) # similarities of vectors 
    
    sim_o1_to_o2 = pd.DataFrame()

    sim_o1_to_o2['objects_1']= objects_1
    sim_o1_to_o2['objects_2']= objects_2
    sim_o1_to_o2['similarity']= similarities

    return sim_o1_to_o2

Generate similarities

get_similarities(df_1,df_2, ['feature_0', 'feature_1', 'feature_2'])

Solution

  • Use Faiss

    import faiss
    
    dimension = 100
    
    value1 = np.random.random((n, dimension)).astype('float32')
    index = faiss.IndexFlatL2(d)
    index.add(value1)
    
    xq = value2
    k= len(value1)
    D, I = index.search(xq, k) 
    

    Note that here D is the distance, and I is the Index of the values.

    Also, value1 and value2 are nothing but NumPy arrays.

    PS: Install faiss first.

    pip install faiss