Search code examples
pandaspivotpivot-tablesparse-matrixone-hot-encoding

Pivot dataframe with multiple categorical variables into sparse matrix


I am trying to pivot a dataframe with categorical features directly into a sparse matrix. My question is similar to this question, or this one, but my dataframe contains multiple categorical variables, so those approaches don't work.

This code currently works, but df.pivot() works with a dense matrix and with my real dataset, I run out of RAM. Can you suggest a way to work with sparse matrices from the start?

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder

# Example dataframe
data = {
  'id':[13,13,14,14,14,15],
  'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']

}
df = pd.DataFrame(data)

# Pivoting -- problematic step creating a huge dense matrix
pivot = df.pivot(index='id', columns='name', values=['categ','size']).fillna('-')
cols = pivot.columns.to_flat_index().str.join('_')
pivot.columns = cols

# One hot encoding into a sparse matrix
encoder = OneHotEncoder(sparse_output=True)
pivot_enc = encoder.fit_transform(pivot)
pivot = pd.DataFrame.sparse.from_spmatrix(
    pivot_enc, columns=encoder.get_feature_names_out())

print(pivot.dtypes)

Thanks for help!


Solution

  • The trick is in using a groupby(), which can work with Sparse types directly, but only if categorical variables are first encoded using .cat.codes.

    Here is the answer which uses sparse matrices throughout:

    for col in ['categ', 'size']:
        df[col] = df[col].astype('category').cat.codes
    
    # Group by into sparse columns
    piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse')
    
    # Unstack keeps sparse format
    piv = piv.unstack(fill_value=0)
    
    piv.columns = piv.columns.to_flat_index().str.join('_')
    
    # Encoding works as previously
    encoder = OneHotEncoder(sparse_output=True)
    piv_enc = encoder.fit_transform(piv)
    piv_fin = pd.DataFrame.sparse.from_spmatrix(
        piv_enc, columns=encoder.get_feature_names_out())
    
    

    The way to keep the original category labels can be found in this post.