I want to calculate cosine similarity between products of a purchases dataset. I have more than 100000 rows ( row = user purchase event) and more than 80000 products.
To avoid using pandas.crosstab
to make the dataset in the format:
> user_id\item_id 1 2 3 4 ...
> 1 | 1 1 0 0
> 2 | 0 1 0 0
> 3 | 1 0 1 0
> 4 | 0 0 0 0
> 5 | 0 0 1 0
> ...
>
> Matrix: Whether a user purchased an item or not
I converted the purchases dataset to scipy.coo_matrix
and thought that I have to then do tocsr()
to do the normalization and similarities calculations between products, but found out that when we transform a coo_matrix
to csr_matrix
it sums the duplicates (which I don't want to happen, I only want 1 and 0 in my matrix).
Is there a way to get around that and calculate cosine similarity?
As a csr_matrix
supports item indexing you can use the following one liner to convert everything larger than one to one
X[X > 1] = 1