Search code examples
pythonloopsmatrixpearson

Efficiently calculate and store similarity matrix


For a recommender-system project in class I am currently trying to build and store an item-based similarity matrix for a dataset with about 7000 users (rows) and 4000 movies (columns). So what I have is a pivot table with UserIDs as index, MovieIDs as columns and the ratings as values. As you can imagine there are a lot of 0-ratings.

Currently I am using the pearsonr function from the scipy package. I figured that in order to store all distances I have to calculate the pearson coefficient between all columns and store them in a symmetric movie-movie matrix. My code up until now (as you can see I am new to Python/coding):

import pandas as pd
import numpy as np
from scipy.stats import pearsonr

pd.read_csv('data.csv')
data = data.pivot(index = 'UserID', columns = 'MovieID', values = "Rating")

similarity_data = pd.DataFrame(index=data.columns, columns=data.columns)

for i in range(0,len(data.columns)):
    for j in range(0,len(data.columns)):
        similarity_data.iloc[i,j] =  pearsonr(data.iloc[:,i],data.iloc[:,j])[0]

Well, as you can imagine this takes forever and I'm eager to find out how to do this more efficiently. My first idea was to take advantage of the matrix being symmetric. But I couldn't figure out how.

My idea was something like:

for i in range(0,len(data.columns)):
    for j in range(0,len(data.columns)):
        similarity_data.iloc[i,j] =  pearsonr(data.iloc[:,i],data.iloc[:,j+i])[0]
        similarity_data[j,i] = similarity_data.iloc[i,j]

However, even if I would get this to work, I fear that the problem here are the two for loops. I was trying to somehow use a map or lambda approach, but couldn't get anywhere.

Any idea how to improve this (probably lots of them)?


Solution

  • You'll definitely want to use np.corrcoef, which will be about 1000 times faster than a naive loop over scipy.stats.pearsonr. For example:

    from scipy.stats import pearsonr
    import numpy as np
    import pandas as pd
    
    # make some small data
    df = pd.DataFrame(np.random.rand(100, 40))
    
    C1 = np.array([[pearsonr(df[i], df[j])[0] for i in df] for j in df])
    C2 = np.corrcoef(df.values.T)
    np.allclose(C1, C2)
    # True
    

    Here are the times:

    %timeit np.array([[pearsonr(df[i], df[j])[0] for i in df] for j in df])
    10 loops, best of 3: 154 ms per loop
    
    %timeit np.corrcoef(df.values.T)
    10000 loops, best of 3: 116 µs per loop
    

    Still, your result will be a dense matrix with about 16 million entries, so it's not going to be a fast computation. You might think about whether you really need to store all those values, or whether you can use an algorithm that (for example) just computes the correlations of nearest neighbors.