Search code examples
pythonk-meanscentroid

What is the standard way of computing centroids when you have unknown data?


I have to compute the centroid for user ratings. My data is stored in a matrix that looks like this (imagine we have 4 users and 12 ratings):

[[0,1,0,-1,0,2,3,4,1,0,0,0],
[0,1,1,-1,0,2,3,4,1,0,2,0],
[0,1,0,0,-1,2,3,4,1,0,0,0],
[0,1,-1,2,0,2,3,4,1,4,-1,-1]]

My problem is that I'm not sure what to do with the unknown data, that is, when the user did not rate everything (values initialized to -1 in my example). Right now, 0 means that the user did not like the object at all, 4 means that they loved it. When computing the centroid, what should I do with the values equal to -1? Right now, my code in python looks like this:

def calc_centroid(ratMatrix):
  centroid = [0 for x in range(len(ratMatrix[0]))] 
  for i in range(len(ratMatrix)): 
    for j in range(len(ratMatrix[i])):
      centroid[j] = centroid[j] + ratMatrix[i][j]
  count = len(ratMatrix[0])
  for i in range(len(centroid)):
    centroid[i] = centroid[i]*1.0/count;
  return centroid

However, I'm not taking into account that "centroid" was calculated using also the -1 values, and I guess that this is not fully correct. What's the standard way of doing this?


Solution

  • I am assuming that centroid is mean average. With 4 ratings of 1, your code returns 0.33. I think it should be 1.

    numpy can do a few things that make this make neater.

    import numpy as np
    
    def calc_centroid(ratMatrix):
      centroid = [0 for x in range(len(ratMatrix[0]))]
      for i in range(len(ratMatrix)):
        for j in range(len(ratMatrix[i])):
          centroid[j] = centroid[j] + ratMatrix[i][j]
      count = len(ratMatrix[0])
      for i in range(len(centroid)):
        centroid[i] = centroid[i]*1.0/count;
      return centroid
    
    def calc_centroid2(ratMatrix):
        mean_ratings = []
        for i in range(ratMatrix.shape[1]): # iterate columns
            col = ratMatrix[:,i]
            col = col[col != -1] #exclude unrated
            mean_ratings.append(np.mean(col))
        return mean_ratings
    
    # 4 users, 12 objects to rate: want the mean rating for each object.
    ratMatrix = np.array([[0,1,0 ,-1,0 ,2,3,4,1,0 ,0, 0],
                          [0,1,1 ,-1,0 ,2,3,4,1,0 ,2, 0],
                          [0,1,0 ,0 ,-1,2,3,4,1,0 ,0, 0],
                          [0,1,-1,2 ,0 ,2,3,4,1,4,-1,-1]])
    
    print(ratMatrix)
    
    centroids = calc_centroid(ratMatrix)
    print(['{:.2f} '.format(i) for i in centroids])
    
    centroids = calc_centroid2(ratMatrix)
    print(['{:.2f} '.format(i) for i in centroids])
    

    This yields

    [[ 0  1  0 -1  0  2  3  4  1  0  0  0]
     [ 0  1  1 -1  0  2  3  4  1  0  2  0]
     [ 0  1  0  0 -1  2  3  4  1  0  0  0]
     [ 0  1 -1  2  0  2  3  4  1  4 -1 -1]]
    ['0.00 ', '0.33 ', '0.00 ', '0.00 ', '-0.08 ', '0.67 ', '1.00 ', '1.33 ', '0.33 ', '0.33 ', '0.08 ', '-0.08 ']
    ['0.00 ', '1.00 ', '0.33 ', '1.00 ', '0.00 ', '2.00 ', '3.00 ', '4.00 ', '1.00 ', '1.00 ', '0.67 ', '0.00 ']