Search code examples
pythonnumpyscipysparse-matrix

How to efficiently store variable number of scipy sparse.csr_matrix in memory?


I have around 10,000 sparse matrices each with size 50,000x5 with 0.0004 density on average. For each loop (10000 times), I'm calculating numpy array and converting it into csr_matrix and appending that to a list. But memory consumption is as high as appending numpy arrays but not as appending csr_matrices.

How to reduce the memory consumption while having these 10K sparse matrices in memory for further computations?

Sample code:

from scipy.sparse import csr_matrix
import numpy as np
sparse_matrices = []

for i in range(10000):
    np_array = get_np_array()
    sparse_matrix = csr_matrix(np_array)
    sparse_matrices.append(sparse_matrix)
    print np_array.nbytes, sparse_matrix.data.nbytes, repr(sparse_matrix)

Would outputs something similar which makes it clear that I'm appending compressed matrices. But still, the memory grows as same as appending numpy matrices.

1987520 520 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
    with 65 stored elements in Compressed Sparse Row format>
1987520 512 <49688x5 sparse matrix of type '<type 'numpy.float64'>'
    with 64 stored elements in Compressed Sparse Row format>

Just realised that if I use coo_matrix instead of csr_matrix, memory consumption is reasonable. If that is csr_matrix memory's around ~8gb.


Solution

  • For the matrix:

    <49688x5 sparse matrix of type '<type 'numpy.float64'>'
    with 65 stored elements in Compressed Sparse Row format>
    

    in coo format, the key attributes are row, col and data, all with 65 elements. data is float, the others integers (row and column indices).

    In csr format the row attribute is replaced with indptr, which has one value per row (plus 1?). With this shape indptr is 49688 elements long. If it was csc format indptr would only be 5 elements.

    csr usually is more compact that coo. But in your case there are many blank rows; so it is much larger. csr will be especially compact if it is a single row matrix; and not compact at all if it is a column vector.