python numpy vectorization recommendation-engine

Vectorized calculation of co-rated items in user-item matrix for Rec. System

I am studying Recommender systems and part of the calculation is to add weight as a number of co-rated items to the distance measure. Consider the matrix below:

Rows represents the users, columns the items. So for example in the first row, this user gave 2 to the product A, did not evaluate product B. Then product C got 3, product D 2 and so on...

ratings = np.array([[2,0,3,2,0,1],
                    [2,5,3,0,0,2],
                    [4,3,0,0,0,5],
                    [3,0,2,4,0,5]])

Now, if two users have less than 3 co-rated items, divide the total number of co-rated items by 3 and then multiply it by distance measure (this step is not included though).

Below is how I create those weights, which works:

#Is False when rating is 0 because I want to exclude those
ratings_aux = ratings > 0

co_rated_aux = np.zeros((ratings.shape[0], ratings.shape[0]))
co_rated = co_rated_aux.copy()

for row in xrange(co_rated.shape[0]):
    for clmn in xrange(co_rated.shape[1]):
        #If both are True, sum them
        co_rated_aux[row,clmn] = np.sum((ratings_aux[row,:] == ratings_aux[clmn,:]) & (ratings_aux[row,:] != False) & (ratings_aux[clmn,:] != False))
        if co_rated_aux[row,clmn] <= 3:
            co_rated[row,clmn] = co_rated_aux[row,clmn]/float(3)
        else:
            co_rated[row,clmn] = 1

Here you can see the number of co-rated items: (for example, first and second user have rated three same items)

print co_rated_aux

[[ 4.  3.  2.  4.]
 [ 3.  4.  3.  3.]
 [ 2.  3.  3.  2.]
 [ 4.  3.  2.  4.]]

And the final weights: (if the two users have more than 3 co-rated items, similarity measure will remain unchanged. If less, then the measure will be decreased)

print co_rated

[[ 1.          1.          0.66666667  1.        ]
 [ 1.          1.          1.          1.        ]
 [ 0.66666667  1.          1.          0.66666667]
 [ 1.          1.          0.66666667  1.        ]]

However, this calculation is quite ugly and will be very slow with bigger arrays. I was trying to get rid of the for loops using only vector operations, but I really do not know how.

I will be happy for any suggestions.

Solution

One vectorized approach would be with broadcasting (watch out for memory usage though, but being booelan arrays would be comparatively less heavy on RAM) -

mask = ratings_aux[None,:,:] == ratings_aux[:,None,:]
mask &= (ratings_aux != False)[None,:,:] & (ratings_aux != False)[:,None,:]
co_rated_aux_out = mask.sum(2) # or np.count_nonzero(mask,axis=2)
co_rated_out = np.where(co_rated_aux_out <=3, co_rated_aux_out/3.0,1)

Basically with the first two steps, we are keeping the last axis aligned and "spreading-out" the first axis from the input array against itself for element-wise comparisons. This gives us a 3D array at each of those two steps. This "spreading-out" is in accordance with rules of broadcasting after we have introduced singleton dimensions/axes with np.newaxis/None in each of those two versions of the input array.

A closer look reveals that, since we are already checking for equality at the first step, we can skip the second item in the second step, like so -

mask &= (ratings_aux != False)[None,:,:]

Again, ratings_aux is a boolean array. So, ratings_aux != False would be essentially the true elements, that's the array itself. So, further improvement there -

mask &= ratings_aux[None,:,:]