Search code examples
pythonsparse-matrixfeature-extraction

In python how to replace nan in sparse csr_matrix


I have hstacked a sprase matrix and a dataframe . The resulting csr_matrix is containing NAN.

My question is how to update these nan values to 0 .

X_train_1hc = sp.sparse.hstack([X_train_1hc, X_train_df.values]).tocsr()

When I pass X_train_1hc to a clasifier I get error Input contains NaN or infinity or a value too large for dtype('float')

1.Is there an option/function/hack to replace nan values in a sparse matrix. This is a conceptual question and hence no data is being provided.


Solution

  • Expanding a bit on Martin's answer, here is one way to do it. Assume you have a csr_matrix with some NaN values:

    >>> Asp.todense()
    matrix([[0.37512508,        nan, 0.34919696, 0.10321203],
            [0.48744859, 0.07289436, 0.16881342, 0.57637166],
            [0.37742037, 0.01425494, 0.38536847, 0.23799655],
            [0.95520474, 0.97719059,        nan, 0.22877082]])
    

    Since the csr_matrix stores the nonzeros in the data attribute, you need to manipulate that array. The replacing all occurences of NaN and inf by 0 and some large number (in fact the largest one representable), you can do

    >>> Asp.data = np.nan_to_num(Asp.data, copy=False)
    >>> Asp.todense()
    matrix([[0.37512508, 0.        , 0.34919696, 0.10321203],
            [0.48744859, 0.07289436, 0.16881342, 0.57637166],
            [0.37742037, 0.01425494, 0.38536847, 0.23799655],
            [0.95520474, 0.97719059, 0.        , 0.22877082]])
    

    Alternatively, you can replace just NaN's manually like this:

    >>> Asp.data[np.isnan(Asp.data)] = 0.0
    >>> Asp.todense()
    matrix([[0.37512508, 0.        , 0.34919696, 0.10321203],
            [0.48744859, 0.07289436, 0.16881342, 0.57637166],
            [0.37742037, 0.01425494, 0.38536847, 0.23799655],
            [0.95520474, 0.97719059, 0.        , 0.22877082]])