Search code examples
pythonpandasnumpyscikit-learncosine-similarity

Replacing for-loop with better alternatives in panda dataframes for similarity measurement


I am working on creating a function which will calculate the cosine similarity of each record in a dataset (MxK dimension) against records in another dataset (NxK dimension) where N is much smaller than M.

The below code does the job well when I test it on a tiny dataset ('iris' dataset for example). I am worried it might struggle when I have bigger datasets ( 100K records & 100+ variables).

I know for loop is not advisable for such scenarios and I got two for loops in this case. I am wondering if anyone can suggest ways of improving this code.

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def similarity_calculation(seed_data, pool_data):
    # Create an empty dataframe to store the similarity scores
    similarity_matrix = pd.DataFrame()
    for indexi, rowi in pool_data.iterrows():
        # Create an array to score similarity score for each record in pool data
        similarity_score_array = []
        for indexj, rowj in seed_data.iterrows():
            # Fetch a single record from pool dataset
            pool = rowi.values.reshape(1, -1)
            # Fetch a single record from seed dataset
            seed = rowj.values.reshape(1, -1)
            # Measure similarity score between the two records
            similarity_score = (cosine_similarity(pool, seed))[0][0]
            similarity_score_array.append(similarity_score)
        # Append the similarity score array as a new record to the similarity matrix
        similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)

Edit1: Sample data iris dataset is used as follows

iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]

Expected result is enter image description here

My new compact code (with a single for loop) is as follows

def similarity_calculation_compact(seed_data, pool_data):
    Array1 = pool_data.values
    Array2 = seed_data.values
    scores = []
    for i in range(Array1.shape[0]):
        scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
    final_data = pool_data.copy()
    final_data['mean_similarity_score'] = scores
    final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
    return(final_data)

The output I am getting is enter image description here

I was expecting identical results as both functions are supposed to fetch records from pool data most similar (in terms of average cosine similarity) to the seed data.


Solution

  • There is no need for the for-loops, since cosine_similarity takes as input two arrays of shapes (n_samples_X, n_features) and (n_samples_Y, n_features) and returns an array of shape (n_samples_X, n_samples_Y) by computing cosine similarity between each pair of the two input arrays.

    import numpy as np
    import pandas as pd
    from sklearn.metrics.pairwise import cosine_similarity
    
    iris_data = pd.read_csv("iris.csv", header=0)
    
    seed_set = iris_data.iloc[:10, :4]
    pool_set = iris_data.iloc[10:, :4]
    
    np.mean(cosine_similarity(pool_set, seed_set), axis=1)
    

    Result (after sorting):

    array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])