Search code examples
pythonnumpyvectorizationrecommendation-engine

Vectorized calculation of co-rated items in user-item matrix for Rec. System


I am studying Recommender systems and part of the calculation is to add weight as a number of co-rated items to the distance measure. Consider the matrix below:

Rows represents the users, columns the items. So for example in the first row, this user gave 2 to the product A, did not evaluate product B. Then product C got 3, product D 2 and so on...

ratings = np.array([[2,0,3,2,0,1],
                    [2,5,3,0,0,2],
                    [4,3,0,0,0,5],
                    [3,0,2,4,0,5]])

Now, if two users have less than 3 co-rated items, divide the total number of co-rated items by 3 and then multiply it by distance measure (this step is not included though).

Below is how I create those weights, which works:

#Is False when rating is 0 because I want to exclude those
ratings_aux = ratings > 0

co_rated_aux = np.zeros((ratings.shape[0], ratings.shape[0]))
co_rated = co_rated_aux.copy()

for row in xrange(co_rated.shape[0]):
    for clmn in xrange(co_rated.shape[1]):
        #If both are True, sum them
        co_rated_aux[row,clmn] = np.sum((ratings_aux[row,:] == ratings_aux[clmn,:]) & (ratings_aux[row,:] != False) & (ratings_aux[clmn,:] != False))
        if co_rated_aux[row,clmn] <= 3:
            co_rated[row,clmn] = co_rated_aux[row,clmn]/float(3)
        else:
            co_rated[row,clmn] = 1

Here you can see the number of co-rated items: (for example, first and second user have rated three same items)

print co_rated_aux

[[ 4.  3.  2.  4.]
 [ 3.  4.  3.  3.]
 [ 2.  3.  3.  2.]
 [ 4.  3.  2.  4.]]

And the final weights: (if the two users have more than 3 co-rated items, similarity measure will remain unchanged. If less, then the measure will be decreased)

print co_rated

[[ 1.          1.          0.66666667  1.        ]
 [ 1.          1.          1.          1.        ]
 [ 0.66666667  1.          1.          0.66666667]
 [ 1.          1.          0.66666667  1.        ]]

However, this calculation is quite ugly and will be very slow with bigger arrays. I was trying to get rid of the for loops using only vector operations, but I really do not know how.

I will be happy for any suggestions.


Solution

  • One vectorized approach would be with broadcasting (watch out for memory usage though, but being booelan arrays would be comparatively less heavy on RAM) -

    mask = ratings_aux[None,:,:] == ratings_aux[:,None,:]
    mask &= (ratings_aux != False)[None,:,:] & (ratings_aux != False)[:,None,:]
    co_rated_aux_out = mask.sum(2) # or np.count_nonzero(mask,axis=2)
    co_rated_out = np.where(co_rated_aux_out <=3, co_rated_aux_out/3.0,1)
    

    Basically with the first two steps, we are keeping the last axis aligned and "spreading-out" the first axis from the input array against itself for element-wise comparisons. This gives us a 3D array at each of those two steps. This "spreading-out" is in accordance with rules of broadcasting after we have introduced singleton dimensions/axes with np.newaxis/None in each of those two versions of the input array.

    A closer look reveals that, since we are already checking for equality at the first step, we can skip the second item in the second step, like so -

    mask &= (ratings_aux != False)[None,:,:]
    

    Again, ratings_aux is a boolean array. So, ratings_aux != False would be essentially the true elements, that's the array itself. So, further improvement there -

    mask &= ratings_aux[None,:,:]