Search code examples
pythonscipysparse-matrixcosine-similarity

Calculate Cosine similarity without matrix with duplicate sum from coo_matrix to csr_matrix


I want to calculate cosine similarity between products of a purchases dataset. I have more than 100000 rows ( row = user purchase event) and more than 80000 products.

To avoid using pandas.crosstab to make the dataset in the format:

>  user_id\item_id  1  2  3  4  ...   
>       1         | 1  1  0  0
>       2         | 0  1  0  0
>       3         | 1  0  1  0
>       4         | 0  0  0  0
>       5         | 0  0  1  0
>       ...
> 
> Matrix: Whether a user purchased an item or not

I converted the purchases dataset to scipy.coo_matrix and thought that I have to then do tocsr() to do the normalization and similarities calculations between products, but found out that when we transform a coo_matrix to csr_matrix it sums the duplicates (which I don't want to happen, I only want 1 and 0 in my matrix).

Is there a way to get around that and calculate cosine similarity?


Solution

  • As a csr_matrix supports item indexing you can use the following one liner to convert everything larger than one to one

    X[X > 1] = 1