In my code I am currently iterating and creating three lists:
data, row, col
There is a high repetition of (row, col)
pairs, and in my final sparse matrix M
I would like the value of M[row, col]
to be the sum of all the corresponding elements in data
. From reading the documentation, the coo_matrix
format seems perfect and for small examples it works just fine.
The problem I am having is that when I scale up my problem size, it looks like the intermediate lists data, row, col
are using up all of my (8gb) of memory and the swap space and my script gets automatically killed.
So my question is:
Is there an appropriate format or an efficient way of incrementally building my summed matrix so I don't have to store the full intermediate lists / numpy arrays?
My program loops over a grid, creating local_data, local_row, local_col
lists at each point, the elements of which are then appended to data, row, col
, so being able to update the sparse matrix with lists as per the sparse matrix constructors would be the ideal case.
There are two things that may be killing you: the duplicates or the overhead of a list over an array. In either case, probably the right thing to do is to grow your list only so big before dumping it into a coo_matrix
and adding it to your total. I took a couple of timings:
rows = list(np.random.randint(100, size=(10000,)))
cols = list(np.random.randint(100, size=(10000,)))
values = list(np.random.rand(10000))
%timeit sps.coo_matrix((values, (rows, cols)))
100 loops, best of 3: 4.03 ms per loop
%timeit (sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000]))) +
sps.coo_matrix((values[5000:], (rows[5000:], cols[5000:]))))
100 loops, best of 3: 5.24 ms per loop
%timeit sps.coo_matrix((values[:5000], (rows[:5000], cols[:5000])))
100 loops, best of 3: 2.16 ms per loop
So there is about a 25% overhead in splitting the lists in two, converting each to a coo_matrix
and then adding them together. And it doesn't seem to be as bad if you do more splits:
%timeit (sps.coo_matrix((values[:2500], (rows[:2500], cols[:2500]))) +
sps.coo_matrix((values[2500:5000], (rows[2500:5000], cols[2500:5000]))) +
sps.coo_matrix((values[5000:7500], (rows[5000:7500], cols[5000:7500]))) +
sps.coo_matrix((values[7500:], (rows[7500:], cols[7500:]))))
100 loops, best of 3: 5.76 ms per loop