Search code examples
pythonnumpyscipysparse-matrix

How to efficiently remove columns from a sparse matrix that only contain zeros?


What is the best way to efficiently remove columns from a sparse matrix that only contain zeros. I have a matrix which I have created and filled with data:

matrix = sp.sparse.lil_matrix((100, 100))

I now wish to remove ~ the last 20 columns which only contain zero data. How can I do this?


Solution

  • If this were just a numpy array, X, then you could say X!=0 which would give you a boolean array of the same shape as X, and then you could index X with the boolean array, i.e. non_zero_entries = X[X!=0]

    But this is a sparse matrix which does not support boolean indexing and also will not give you what you want if you try X!=0 -- it just returns a single boolean value that seems to only return true if they are the exact same matrix (in memory).

    What you want is the nonzero method from numpy.

    import numpy as np
    from scipy import sparse
    
    X = sparse.lil_matrix((100,100)) # some sparse matrix
    X[1,17] = 1
    X[17,17] = 1
    indices = np.nonzero(X) # a tuple of two arrays: 0th is row indices, 1st is cols
    X.tocsc()[indices] # this just gives you the array of all non-zero entries
    

    If you want only the full columns where there are non-zero entries, then just take the 1st from indices. Except you need to account for the repeated indices (if there are more than one entries in a column):

    columns_non_unique = indices[1]
    unique_columns = sorted(set(columns_non_unique))
    X.tocsc()[:,unique_columns]