I am studying Recommender systems and part of the calculation is to add weight as a number of co-rated items to the distance measure. Consider the matrix below:
Rows represents the users, columns the items. So for example in the first row, this user gave 2 to the product A, did not evaluate product B. Then product C got 3, product D 2 and so on...
ratings = np.array([[2,0,3,2,0,1],
[2,5,3,0,0,2],
[4,3,0,0,0,5],
[3,0,2,4,0,5]])
Now, if two users have less than 3 co-rated items, divide the total number of co-rated items by 3 and then multiply it by distance measure (this step is not included though).
Below is how I create those weights, which works:
#Is False when rating is 0 because I want to exclude those
ratings_aux = ratings > 0
co_rated_aux = np.zeros((ratings.shape[0], ratings.shape[0]))
co_rated = co_rated_aux.copy()
for row in xrange(co_rated.shape[0]):
for clmn in xrange(co_rated.shape[1]):
#If both are True, sum them
co_rated_aux[row,clmn] = np.sum((ratings_aux[row,:] == ratings_aux[clmn,:]) & (ratings_aux[row,:] != False) & (ratings_aux[clmn,:] != False))
if co_rated_aux[row,clmn] <= 3:
co_rated[row,clmn] = co_rated_aux[row,clmn]/float(3)
else:
co_rated[row,clmn] = 1
Here you can see the number of co-rated items: (for example, first and second user have rated three same items)
print co_rated_aux
[[ 4. 3. 2. 4.]
[ 3. 4. 3. 3.]
[ 2. 3. 3. 2.]
[ 4. 3. 2. 4.]]
And the final weights: (if the two users have more than 3 co-rated items, similarity measure will remain unchanged. If less, then the measure will be decreased)
print co_rated
[[ 1. 1. 0.66666667 1. ]
[ 1. 1. 1. 1. ]
[ 0.66666667 1. 1. 0.66666667]
[ 1. 1. 0.66666667 1. ]]
However, this calculation is quite ugly and will be very slow with bigger arrays. I was trying to get rid of the for loops using only vector operations, but I really do not know how.
I will be happy for any suggestions.
One vectorized approach would be with broadcasting
(watch out for memory usage though, but being booelan arrays would be comparatively less heavy on RAM) -
mask = ratings_aux[None,:,:] == ratings_aux[:,None,:]
mask &= (ratings_aux != False)[None,:,:] & (ratings_aux != False)[:,None,:]
co_rated_aux_out = mask.sum(2) # or np.count_nonzero(mask,axis=2)
co_rated_out = np.where(co_rated_aux_out <=3, co_rated_aux_out/3.0,1)
Basically with the first two steps, we are keeping the last axis aligned and "spreading-out" the first axis from the input array against itself for element-wise comparisons. This gives us a 3D array at each of those two steps. This "spreading-out" is in accordance with rules of broadcasting
after we have introduced singleton
dimensions/axes with np.newaxis/None
in each of those two versions of the input array.
A closer look reveals that, since we are already checking for equality at the first step, we can skip the second item in the second step, like so -
mask &= (ratings_aux != False)[None,:,:]
Again, ratings_aux
is a boolean array. So, ratings_aux != False
would be essentially the true elements, that's the array itself. So, further improvement there -
mask &= ratings_aux[None,:,:]