Search code examples
pythonpandasscipyscikit-learnsparse-matrix

How to convert a dataframe to sparse matrix with mixed column types?


I have a data frame of following format:

df:

key   f1    f2
k1    10    a, b, c
k2    20    b, d
k3    15    NaN

The column f2 has a bag of words as values. I want to convert this data frame into a sparse matrix, as distinct words in f2 run to a few thousands. The end result I am expecting is of following format:

key    f1  f2.a  f2.b  f2.c  f2.d
k1     10   1     1     1     0
k2     20   0     1     0     1
k3     15   0     0     0     0

I could figure out how to independently create a sparse matrix just out of key and f2 field. I am first melting the column f2 so I get following dataframe:

df1:
key  f2
k1   a
k1   b
k1   c
k2   b
k2   d

Then I am encoding f2, and using the LabelEncoder from sklearn.preprocessing package to encode f2. Then I am creating a sparse matrix as follows:

df1['trainrow'] = np.arrange(df1.shape[0])
sparse.csr_matrix((np.ones(df1.shape[0], (df1.trainrow, df1.f2_encoded)))

This creates a sparse matrix by doing a one-hot encoding of field f2. But I am not sure how I can concatenate this with the numerical field f1.


Solution

  • You can use concat with str.get_dummies and add_prefix:

    df = pd.concat([df[['key','f1']], df.f2.str.get_dummies(sep=', ').add_prefix('f2.')], axis=1)
    print (df)
      key  f1  f2.a  f2.b  f2.c  f2.d
    0  k1  10     1     1     1     0
    1  k2  20     0     1     0     1
    2  k3  15     0     0     0     0
    

    In very large distinct values get_dummies is very slow, you can use custom function f:

    def f(category_list):
        n_categories = len(category_list)
        return pd.Series(dict(zip(category_list, [1]*n_categories)))
    
    #remove NaN rows and create list of values by split
    df1 = df.f2.dropna().str.split(', ').apply(f).add_prefix('f2.')
    df2 = pd.concat([df[['key','f1']], df1], axis=1)
    #replace NaN to 0 by position from 3.column to end of df
    df2.iloc[:, 2: ] = df2.iloc[:, 2: ].fillna(0).astype(int)
    print (df2)
      key  f1  f2.a  f2.b  f2.c  f2.d
    0  k1  10     1     1     1     0
    1  k2  20     0     1     0     1
    2  k3  15     0     0     0     0
    

    Timings:

    In [256]: %timeit s.str.get_dummies(sep=', ')
    1 loop, best of 3: 1min 16s per loop
    
    In [257]: %timeit (s.dropna().str.split(', ').apply(f).fillna(0).astype(int))
    1 loop, best of 3: 2.95 s per loop
    

    Code for timings:

    np.random.seed(100)
    s = pd.DataFrame(np.random.randint(10000, size=(1000,1000))).astype(str).apply(', '.join, axis=1)
    print (s)
    
    
    df2 = s.str.get_dummies(sep=', ')
    print (df2)
    
    def f(category_list):
        n_categories = len(category_list)
        return pd.Series(dict(zip(category_list, [1]*n_categories)))
    
    print (s.dropna().str.split(', ').apply(f).fillna(0).astype(int))