Search code examples
pythonscipysparse-matrix

Scipy coo_matrix.max() alters data attribute


I am building a recommendation system using an open source library, LightFM. This library requires certain pieces of data to be in a sparse matrix format, specifically the scipy coo_matrix. It is here that I am encountering strange behavior. It seems like a bug, but it's more likely that I am doing something wrong.

Basically, I let LightFM.Dataset build me a sparse matrix, like so:

interactions, weights = dataset.build_interactions(data=_get_interactions_data())

The method, build_interactions, returns "Two COO matrices: the interactions matrix and the corresponding weights matrix" -- LightFM Official Doc.

When I inspect the contents of this sparse matrix (in practice, I use a debugger), like so:

for i in interactions.data:
    print(i, end=', ')

1, 1, 1, 1, 1, ....

It prints a long list of 1s, which indicates that the sparse matrix's nonzero elements are only 1s.

However, when I first check the max of the sparse matrix, it indicates that the maximum values in the sparse matrix is not a 1, its a 3. Furthermore, printing the matrix after that check will print a long list of 1s, 2s, and 3s. This is the code for that:

print(interactions.max())
for i in interactions.data:
    print(i, end=', ')

3
1, 1, 3, 2, 1, 2, ...

Any idea what is going on here? Python is 3.6.8. Scipy is 1.5.4. CentOS7.

Thank you.


Solution

  • A 'raw' coo_matrix can have duplicate elements (repeats of the same row and col values), but when converted to csr format for calculations those duplicates are summed. It must be doing the same, but in-place, in order to find that max.

    In [9]: from scipy import sparse
    In [10]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
    In [11]: M.data
    Out[11]: array([1, 1, 1, 1, 1, 1])
    In [12]: M.max()
    Out[12]: 3
    In [13]: M.data
    Out[13]: array([3, 2, 1])
    In [14]: M
    Out[14]: 
    <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in COOrdinate format>
    

    Tracing through the max code I find it uses sum_duplicates

    In [33]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
    In [34]: M.data
    Out[34]: array([1, 1, 1, 1, 1, 1])
    In [35]: M.sum_duplicates?
    Signature: M.sum_duplicates()
    Docstring:
    Eliminate duplicate matrix entries by adding them together
    
    This is an *in place* operation
    File:      /usr/local/lib/python3.8/dist-packages/scipy/sparse/coo.py
    Type:      method
    In [36]: M.sum_duplicates()
    In [37]: M.data
    Out[37]: array([3, 2, 1])