I am working on creating a function which will calculate the cosine similarity of each record in a dataset (MxK dimension) against records in another dataset (NxK dimension) where N is much smaller than M.
The below code does the job well when I test it on a tiny dataset ('iris' dataset for example). I am worried it might struggle when I have bigger datasets ( 100K records & 100+ variables).
I know for loop is not advisable for such scenarios and I got two for loops in this case. I am wondering if anyone can suggest ways of improving this code.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1: Sample data iris dataset is used as follows
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
My new compact code (with a single for loop) is as follows
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
I was expecting identical results as both functions are supposed to fetch records from pool data most similar (in terms of average cosine similarity) to the seed data.
There is no need for the for-loops, since cosine_similarity
takes as input two arrays of shapes (n_samples_X, n_features)
and (n_samples_Y, n_features)
and returns an array of shape (n_samples_X, n_samples_Y)
by computing cosine similarity between each pair of the two input arrays.
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
Result (after sorting):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])