Search code examples
pythonnumpymatrixscipysparse-matrix

Scipy Handling of Large COO matrix


I have a large sparse matrix in the form of a scipy coo_matrix (size of 5GB). I have to make use of the non-zero entries of the matrix and do some further processing.

What would be the best way to access the elements of the matrix? Should I convert the matrix to other formats or use it as it is? Also, could you please tell me the exact syntax for accessing an element of a coo_matrix? I got a bit confused since it doesn't allow slicing.


Solution

  • First let's build a random COO matrix:

    import numpy as np
    from scipy import sparse
    
    x = sparse.rand(10000, 10000, format='coo')
    

    The non-zero values are found in the .data attribute of the matrix, and you can get their corresponding row/column indices using x.nonzero():

    v = x.data
    r, c = x.nonzero()
    
    print np.all(x.todense()[r, c] == v)
    # True
    

    With a COO matrix it's possible to index a single row or column (as a sparse vector) using the getrow()/getcol() methods. If you want to do slicing or fancy indexing of particular elements then you need to convert it to another format such as lil_matrix, for example using the .tolil() method.

    You should really read the scipy.sparse docs for more information about the features of the different sparse array formats - the appropriate choice of format really depends on what you plan on doing with your array.