Search code examples
pythonpandasone-hot-encoding

OneHotEncoder -- keep feature names after encoding categorical variables


Question

After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step.

Solution

Below is my solution, which is really convoluted, please let me know if you have a better way:

import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder

# Example dataframe
data = {
  'id':[13,13,14,14,14,15],
  'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
  'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
  'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']

}
df = pd.DataFrame(data)

# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = ['categ', 'size']

for col in df_cols:
    cats = df[col].astype('category')
    dct = dict(enumerate(cats.cat.categories))
    dcts.append(dct)

# Change into category codes, otherwise sparse matrix cannot be built
for col in ['categ', 'size']:
    df[col] = df[col].astype('category').cat.codes

# Group by into sparse columns
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')

# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)

piv.columns = piv.columns.to_flat_index().str.join('_')

# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
    piv_enc, columns=encoder.get_feature_names_out())

The column names look like this: 'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0', while we need the original category labels, i.e. 'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'.

Convoluted part I need advice on

# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
    print(df_col, dct)
    for i, piv_col in enumerate(piv_cols):
        if df_col in piv_col:
            if piv_col[-1:] != '-':
                piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]

piv_fin.columns = piv_cols
 

I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!


Solution

  • You can make things a bit easier by using a dictionary instead of a list to save categories:

    # Create dictionaries from original dataframe to save categories
    dcts = {}  # instead of []
    df_cols = ["categ", "size"]
    
    for col in df_cols:
        cats = df[col].astype("category")
        dct = dict(enumerate(cats.cat.categories))
        dcts[col] = dct  # instead of dcts.append(dct)
    

    Then, using str.replace from Python standard library:

    # Fixing column names
    piv_cols = [
        col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
        if str.isnumeric(col[-1])
        else col
        for col in piv_fin.columns
    ]
    

    So that:

    print(piv_cols)
    # Output
    
    ['categ_alex_-',
     'categ_alex_dog',
     'categ_barry_-',
     'categ_barry_ant',
     'categ_john_-',
     'categ_john_fox',
     'categ_john_seal',
     'categ_mary_-',
     'categ_mary_cat',
     'size_alex_-',
     'size_alex_big',
     'size_barry_-',
     'size_barry_tiny',
     'size_john_-',
     'size_john_big',
     'size_john_medium',
     'size_mary_-',
     'size_mary_small']