python pandas numpy dataframe recommendation-engine

Predicting Values in Movie Recommendations

I've been trying to create a recommendation system using the movielens dataset in python. My goal is to determine the similarity between users and then output the top five recommended movies for each user in this format:

User-id1 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
User-id2 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5

The data I am using for now is this ratings dataset.

Here is the code so far:

import pandas as pd
import numpy as np
from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import mean_squared_error
from math import sqrt
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('ratings.csv')


df.drop('timestamp', axis=1, inplace=True)
n_users = df.userId.unique().shape[0]
n_items = df.movieId.unique().shape[0]

#Pivot table so users are rows and movies are columns, ratings are then values
df = df.pivot(index='userId', columns='movieId', values='rating')

#subtract row mean from each rating to center data
df = df.sub(df.mean(axis=1), axis=0)

#copy to fill in predictions
c1 = df.copy()
c1 = c1.fillna('a')

#second copy to find which values were filled in and return the highest rated values
c2 = c1.copy()

#fill NAN with 0
df = df.fillna(0)


#Get cosine similarity between rows
similarity = pd.DataFrame(cosine_similarity(df))

#get top 5 similar profiles
tmp = similarity.apply(lambda row: sorted(zip(similarity.columns, row), key=lambda c: -c[1]), axis=1)
tmp = tmp.ix[:,1:6]
l = np.array(tmp)

##Prediction function - does not work needs improvement
def predict(df, c1, l):
for i in range(c1.shape[0]):
    for j in range(i+1, c1.shape[1]):
        try:
            if c1.iloc[i][j] == 'a':
                num = df[l[i][0][0]]*l[i][0][1] + df[l[i][1][0]]*l[i][1][1] + df[l[i][2][0]]*l[i][2][1] + df[l[i][3][0]]*l[i][3][1] + df[l[i][4][0]]*l[i][4][1]
                den = l[i][0][1] + l[i][1][0] + l[i][2][0] + l[i][3][0] + l[i][4][0]
                c1[i][j] = num/den
        except:
            pass
return c1

res = predict(df, c1, l)
print(res)

res = predict(df, c1, l)
print(res)

I am trying to implement the prediction function. I want to predict the missing values and add them to c1. I am trying to implement this. The formula as well as an example of how it should be used is in the picture. As you can see it uses the similarity scores of the most similar users.

The output of similarity looks like this: For example here is user1's similarity:

[(34, 0.19269904365720053) (196, 0.19187531680008307)
 (538, 0.14932027335788825) (67, 0.14093020024386654)
 (419, 0.11034407313683092) (319, 0.10055810007385564)]

I need help using these similarities in the prediction function to predict missing movie ratings. If that is solved I will then have to find the top 5 recommended movies for each user and output them in the format above.

I currently need help with the prediction function. Any advice helps. Please let me know if you need any more information or clarification.

Thank you for reading

Solution

First of all, vectorisation makes complex problems much easier. here are a few suggestion to improve what you already have

use the userID as as columns in the pivot table, this makes the example for the prediction easier to see
NaN stands for missing value, which is conceptually not the same as 0, in this particular case a high negative number will do and will only bee needed when using the cosine similarity function
take advantage of pandas advanced features, e.g. to keep the original values but add the predictions, fillna can be used
when constructing the similarity dataframe be sure to keep track of the useIds, you can do so by setting the index and columns to df.columns

Here is my edited version of the code including predict implementation:

```

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import scale


def predict(l):
    # finds the userIds corresponding to the top 5 similarities
    # calculate the prediction according to the formula
    return (df[l.index] * l).sum(axis=1) / l.sum()


# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
                                                index='movieId',
                                                values='rating')

similarity = pd.DataFrame(cosine_similarity(
    scale(df.T.fillna(-1000))),
    index=df.columns,
    columns=df.columns)
# iterate each column (userID),
# for each userID find the highest five similarities
# and use to calculate the prediction for that user,
# use fillna so that original ratings dont change

res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in col.fillna(
    predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))
print(res)

```

here is a sample of the output

userId
1    1172 1953 2105 1339 1029
2           17 39 150 222 265
3      318 356 1197 2959 3949
4          34 112 141 260 296
5    597 1035 1380 2081 33166
dtype: object

Edit

The code above will recommends the top 5 regardless of whether the user already watched/rated them. to fix this we can reset the values of the original ratings to 0 when choosing the recommendations as shown below\

res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in (0 * col).fillna(
    predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))

The output

userId
1           2278 4085 3072 585 256
2               595 597 32 344 316
3              590 457 150 380 253
4         1375 2571 2011 1287 2455
5              480 590 457 296 165
6          1196 7064 26151 260 480
....