After encoding categorical columns as numbers and pivoting LONG to WIDE into a sparse matrix, I am trying to retrieve the category labels for column names. I need this information to interpret the model in a latter step.
Below is my solution, which is really convoluted, please let me know if you have a better way:
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.preprocessing import OneHotEncoder
# Example dataframe
data = {
'id':[13,13,14,14,14,15],
'name':['alex', 'mary', 'alex', 'barry', 'john', 'john'],
'categ': ['dog', 'cat', 'dog', 'ant', 'fox', 'seal'],
'size': ['big', 'small', 'big', 'tiny', 'medium', 'big']
}
df = pd.DataFrame(data)
# Create dictionaries from original dataframe to save categories
# Part of the convoluted solution
dcts = []
df_cols = ['categ', 'size']
for col in df_cols:
cats = df[col].astype('category')
dct = dict(enumerate(cats.cat.categories))
dcts.append(dct)
# Change into category codes, otherwise sparse matrix cannot be built
for col in ['categ', 'size']:
df[col] = df[col].astype('category').cat.codes
# Group by into sparse columns
piv = df.groupby(['id', 'name'])[['categ', 'size']].first().astype('Sparse[int]')
# Unstack keeps sparse format
piv = piv.unstack(fill_value=0)
piv.columns = piv.columns.to_flat_index().str.join('_')
# Encoding gives poor column names
encoder = OneHotEncoder(sparse_output=True)
piv_enc = encoder.fit_transform(piv)
piv_fin = pd.DataFrame.sparse.from_spmatrix(
piv_enc, columns=encoder.get_feature_names_out())
The column names look like this: 'categ_alex_-', 'categ_alex_2.0', 'categ_barry_-', 'categ_barry_0.0'
, while we need the original category labels, i.e. 'categ_alex_-', 'categ_alex_dog', 'categ_barry_-', 'categ_barry_ant'
.
# Fixing column names
piv_cols = list(piv_fin.columns)
for (dct, df_col) in zip(dcts, df_cols):
print(df_col, dct)
for i, piv_col in enumerate(piv_cols):
if df_col in piv_col:
if piv_col[-1:] != '-':
piv_cols[i] = piv_col[:-2] + '_' + dct[int(piv_col[-1:])]
piv_fin.columns = piv_cols
I'm sure there's a better way, perhaps OneHotEncoder can use category labels directly? Thanks for help!
You can make things a bit easier by using a dictionary instead of a list to save categories:
# Create dictionaries from original dataframe to save categories
dcts = {} # instead of []
df_cols = ["categ", "size"]
for col in df_cols:
cats = df[col].astype("category")
dct = dict(enumerate(cats.cat.categories))
dcts[col] = dct # instead of dcts.append(dct)
Then, using str.replace from Python standard library:
# Fixing column names
piv_cols = [
col.replace(col[-1], dcts[col.split("_")[0]][int(col[-1])])
if str.isnumeric(col[-1])
else col
for col in piv_fin.columns
]
So that:
print(piv_cols)
# Output
['categ_alex_-',
'categ_alex_dog',
'categ_barry_-',
'categ_barry_ant',
'categ_john_-',
'categ_john_fox',
'categ_john_seal',
'categ_mary_-',
'categ_mary_cat',
'size_alex_-',
'size_alex_big',
'size_barry_-',
'size_barry_tiny',
'size_john_-',
'size_john_big',
'size_john_medium',
'size_mary_-',
'size_mary_small']