I have to compute the centroid for user ratings. My data is stored in a matrix that looks like this (imagine we have 4 users and 12 ratings):
[[0,1,0,-1,0,2,3,4,1,0,0,0],
[0,1,1,-1,0,2,3,4,1,0,2,0],
[0,1,0,0,-1,2,3,4,1,0,0,0],
[0,1,-1,2,0,2,3,4,1,4,-1,-1]]
My problem is that I'm not sure what to do with the unknown data, that is, when the user did not rate everything (values initialized to -1 in my example). Right now, 0 means that the user did not like the object at all, 4 means that they loved it. When computing the centroid, what should I do with the values equal to -1? Right now, my code in python looks like this:
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
However, I'm not taking into account that "centroid" was calculated using also the -1 values, and I guess that this is not fully correct. What's the standard way of doing this?
I am assuming that centroid is mean average. With 4 ratings of 1, your code returns 0.33. I think it should be 1.
numpy can do a few things that make this make neater.
import numpy as np
def calc_centroid(ratMatrix):
centroid = [0 for x in range(len(ratMatrix[0]))]
for i in range(len(ratMatrix)):
for j in range(len(ratMatrix[i])):
centroid[j] = centroid[j] + ratMatrix[i][j]
count = len(ratMatrix[0])
for i in range(len(centroid)):
centroid[i] = centroid[i]*1.0/count;
return centroid
def calc_centroid2(ratMatrix):
mean_ratings = []
for i in range(ratMatrix.shape[1]): # iterate columns
col = ratMatrix[:,i]
col = col[col != -1] #exclude unrated
mean_ratings.append(np.mean(col))
return mean_ratings
# 4 users, 12 objects to rate: want the mean rating for each object.
ratMatrix = np.array([[0,1,0 ,-1,0 ,2,3,4,1,0 ,0, 0],
[0,1,1 ,-1,0 ,2,3,4,1,0 ,2, 0],
[0,1,0 ,0 ,-1,2,3,4,1,0 ,0, 0],
[0,1,-1,2 ,0 ,2,3,4,1,4,-1,-1]])
print(ratMatrix)
centroids = calc_centroid(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
centroids = calc_centroid2(ratMatrix)
print(['{:.2f} '.format(i) for i in centroids])
This yields
[[ 0 1 0 -1 0 2 3 4 1 0 0 0]
[ 0 1 1 -1 0 2 3 4 1 0 2 0]
[ 0 1 0 0 -1 2 3 4 1 0 0 0]
[ 0 1 -1 2 0 2 3 4 1 4 -1 -1]]
['0.00 ', '0.33 ', '0.00 ', '0.00 ', '-0.08 ', '0.67 ', '1.00 ', '1.33 ', '0.33 ', '0.33 ', '0.08 ', '-0.08 ']
['0.00 ', '1.00 ', '0.33 ', '1.00 ', '0.00 ', '2.00 ', '3.00 ', '4.00 ', '1.00 ', '1.00 ', '0.67 ', '0.00 ']