Search code examples
pandasscipysparse-matrixsklearn-pandas

Convert pandas single column to Scipy Sparse Matrix


I have a pandas data frame like this:

     a                           other-columns
   0.3 0.2 0.0 0.0 0.0...        ....

I want to convert column a into SciPy sparse CSR matrix. a is a probability distribution. I would like to convert without expanding a into multiple columns.

This is naive solution with expanding a into multiple columns:

  df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
  df_matrix = scipy.sparse.csr_matrix(df.values)

But, I don't want to expand into multiple columns, as it shoots up the memory. Is it possible to do this by keeping a in 1 column only?

EDIT (Minimum Reproducible Example):

 import pandas as pd
 from scipy.sparse import csr_matrix
 d = {'a': ['0.05 0.0', '0.2 0.0']}
 df = pd.DataFrame(data=d)
 df = df.join(df['a'].str.split(expand = True).add_prefix('a')).drop(['a'], axis = 1)
 df = df.astype(float)
 df_matrix = scipy.sparse.csr_matrix(df.values)
 df_matrix

Output:

 <2x2 sparse matrix of type '<class 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>

I want to achieve above, but, without splitting into multiple columns. Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.


Solution

  • Also, in my real file, I have 36 length string (separated by space) columns and millions of rows. It is sure that all rows will contain 36 spaces.

    Convert large csv to sparse matrix for use in sklearn

    I can not overstate how much you should not do the thing that follows this sentence.

    import pandas as pd
    import numpy as np
    from scipy import sparse
    
    df = pd.DataFrame({'a': ['0.05 0.0', '0.2 0.0'] * 100000})
    chunksize = 10000
    
    sparse_coo = []
    for i in range(int(np.ceil(df.shape[0]/chunksize))):
        chunk = df.iloc[i * chunksize:min(i * chunksize +chunksize, df.shape[0]), :]
        sparse_coo.append(sparse.coo_matrix(chunk['a'].apply(lambda x: [float(y) for y in x.split()]).tolist()))
    
    sparse_coo = sparse.vstack(sparse_coo)