I am building a recommendation system using an open source library, LightFM. This library requires certain pieces of data to be in a sparse matrix format, specifically the scipy coo_matrix. It is here that I am encountering strange behavior. It seems like a bug, but it's more likely that I am doing something wrong.
Basically, I let LightFM.Dataset build me a sparse matrix, like so:
interactions, weights = dataset.build_interactions(data=_get_interactions_data())
The method, build_interactions, returns "Two COO matrices: the interactions matrix and the corresponding weights matrix" -- LightFM Official Doc.
When I inspect the contents of this sparse matrix (in practice, I use a debugger), like so:
for i in interactions.data:
print(i, end=', ')
1, 1, 1, 1, 1, ....
It prints a long list of 1s, which indicates that the sparse matrix's nonzero elements are only 1s.
However, when I first check the max of the sparse matrix, it indicates that the maximum values in the sparse matrix is not a 1, its a 3. Furthermore, printing the matrix after that check will print a long list of 1s, 2s, and 3s. This is the code for that:
print(interactions.max())
for i in interactions.data:
print(i, end=', ')
3
1, 1, 3, 2, 1, 2, ...
Any idea what is going on here? Python is 3.6.8. Scipy is 1.5.4. CentOS7.
Thank you.
A 'raw' coo_matrix can have duplicate elements (repeats of the same row and col values), but when converted to csr format for calculations those duplicates are summed. It must be doing the same, but in-place, in order to find that max.
In [9]: from scipy import sparse
In [10]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
In [11]: M.data
Out[11]: array([1, 1, 1, 1, 1, 1])
In [12]: M.max()
Out[12]: 3
In [13]: M.data
Out[13]: array([3, 2, 1])
In [14]: M
Out[14]:
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
Tracing through the max
code I find it uses sum_duplicates
In [33]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
In [34]: M.data
Out[34]: array([1, 1, 1, 1, 1, 1])
In [35]: M.sum_duplicates?
Signature: M.sum_duplicates()
Docstring:
Eliminate duplicate matrix entries by adding them together
This is an *in place* operation
File: /usr/local/lib/python3.8/dist-packages/scipy/sparse/coo.py
Type: method
In [36]: M.sum_duplicates()
In [37]: M.data
Out[37]: array([3, 2, 1])