Search code examples
pythonnumpymatrixscipysparse-matrix

How to convert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?


I have a very large sparse Numpy matrix (of type numpy.ndarray). The matrix is so large that it probably has to be stored in the virtual memory. How can I efficiently convert it to a sparse Scipy matrix (from scipy.sparse)(to be used for arithmetic operations) ?

The following is a direct conversion with dok_matrix, which fails due to a memory issue presumably. Changing dok_matrix to csr_matrix causes the same memory issue.

In [1]: N=int(1e6)

In [2]: from scipy.sparse import dok_matrix

In [3]: import numpy as np

In [4]: mat=np.zeros((N,N))

In [5]: dok_matrix(mat)
zsh: killed     ipython

My current method is to use a nested loop, which is slow even if I do nothing.

In [9]: for i in range(N):
   ...:     for j in range (N):
   ...:         pass

Any efficient solution to convert a large (10^6 * 10^6) Numpy sparse matrix to a Scipy sparse matrix?


Solution

  • When you do mat = np.zeros((N,N)), Numpy allocates a big matrix full of zeros. To do that, it requests a big zeroized buffer to the operating system (OS). Most OS do not actually perform the allocation in physical memory but in virtual memory. The memory pages are mapped during a first-touch. This means any read or write cause virtual pages to be mapped in physical memory. Please see this post for more information about this. Also please consider reading the famous article What Every Programmer Should Know About Memory for more information about virtual memory. The thing is dok_matrix(mat) needs to read the whole matrix so to create the sparse matrix so it will trigger the mapping of all pages of the matrix resulting in an out of memory. When there is no more space left, the Linux's OOM-Killer kills programs using too much memory, hence a killed ipython message. The same problem happens with any kind of sparse matrices. The main problem is you cannot read the whole created matrix. In fact, creating such matrix is nearly pointless unless you know that only some tiny parts will be read/written (never the full matrix).

    The solution is to create a space matrix directly and fill it like Numpy dense matrices. It is significantly slower, but this is the price to pay for using sparse matrices. DOK matrices generally takes a lot of memory unless only few items are filled. That being said, DOK matrices are one of the fastest for random accesses (because they are implemented using a hash-table internally). CSR are good for matrices that do not change once created (ie. modifying a CSR matrix is very slow) and with only few non-zero items per line. Note that CSR matrices can be created quite quickly from a data 1D array and (row_ind, col_ind) array tuple. See the documentation for more information.

    The sparsity of the matrix, that is, the ratio NumberOfNonZeroValue / NumberOfValues needs to be (very) small for the sparse matrix to be useful. Sparse matrices tends to be slow because their internal representation operates on a kind of compressed matrix.