Search code examples
pythonnumpymatrixscipysparse-matrix

Repeat a csr_matrix row over axis 0 to create a matrix


I have a CSR formatted sparse matrix (scipy.sparse.csr_matrix) with around 100,000 rows and 10,000 columns. The rows represent users, and the columns represent items, and the values in the matrix, the rating for that user and item.

I am trying to calculate correlation between two users. So I am looping over each user (call it user_a), and doing matrix operations to get the correlation of user_a against all other users.

The first step, is to generate the current user matrix. This matrix contains the elements of the current user, masked to match the common elements of user_a with all other users.

My code at the moment is:

# ratings is the big original matrix
R = ratings.getrow(user_id)
user_matrix = sparse.csr_matrix(R)
user_matrix = user_matrix[numpy.array([0]).repeat(ratings.shape[0]),:]
user_matrix = user_matrix.multiply(ratings.astype(numpy.bool))

(https://stackoverflow.com/a/25342156/947194)

But these lines take 4 seconds for a user with just 500 items. And I need to run it for each user (100,000 times). So it is a bit slow.

I tried generating user_matrix using vstack, but it took 7 seconds

Is there a way to reduce a bit more the time of these lines?


Solution

  • For a csr_matrix ratings and an integer user_id, this gives the same result as your code:

    valid_ratings = ratings.astype(bool)
    user_matrix = valid_ratings.multiply(ratings[user_id])
    

    But it won't work if your version of scipy is too old. I don't recall which version of scipy extended the broadcasting behavior of sparse matrices to make this work.