Search code examples
pythonnumpyscipysparse-matrixscientific-computing

What is the right SciPy sparse matrix format for incremental summation


In my code I am currently iterating and creating three lists:

data, row, col

There is a high repetition of (row, col) pairs, and in my final sparse matrix M I would like the value of M[row, col] to be the sum of all the corresponding elements in data. From reading the documentation, the coo_matrix format seems perfect and for small examples it works just fine.

The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.

So my question is:

Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?

My program loops over a grid, creating local_data, local_row, local_col lists at each point, the elements of which are then appended to data, row, col, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.


Solution

  • There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix and adding it to your total. I took a couple of timings:

    rows = list(np.random.randint(100, size=(10000,)))
    cols = list(np.random.randint(100, size=(10000,)))
    values = list(np.random.rand(10000))
    
    %timeit sps.coo_matrix((values, (rows, cols)))
    100 loops, best of 3: 4.03 ms per loop
    
    %timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
             sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
    100 loops, best of 3: 5.24 ms per loop
    
    %timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
    100 loops, best of 3: 2.16 ms per loop
    

    So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix and then adding them together. And it doesn't seem to be as bad if you do more splits:

    %timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +   
             sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +  
             sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) + 
             sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
    100 loops, best of 3: 5.76 ms per loop