Search code examples
pythonpandasindexingone-hot-encoding

Can I add Multilevel Indexing for one-hot encoded features?


I am working on a dataset of mushroom features, almost all of which I encoded with pandas into binary but some are nominally encoded. I am wondering if I can take the original columns as a second index so that it would look something like this:

Cap Shape:

Bell   Conical  Flat
1      0        0

rather than:

Cap Shape_Bell    Cap Shape_Conical   Cap Shape_Flat

1                 0                   0

This is the code I used to dummy encode them for reference.

mode = df['Stalk Root'].mode() #most common amswer is b

df = df.replace('?', 'b') #replace all question marks with most common value

df['Ring Number'] = df['Ring Number'].replace({'n': 0, 'o': 1, 't': 2}).astype(int)
df['Gill Spacing'] = df['Gill Spacing'].replace({'c': 0, 'w': 1, 'd': 2}).astype(int)

df = pd.get_dummies(df)

df.drop(labels = ['Poisonous_e', 'Bruises_f', 'Gill Size_n', 'Stalk Shape_t', 'Veil Type_p'], axis = 1, inplace = True)
df.rename(columns={'Poisonous_p': 'Poisonous', 'Bruises_t': 'Bruises'}, inplace = True)

I haven't tried much because all the resources I had previously found didn't quite make sense to me. I have looked into pd.Multilevel.index but .from_frame, which is for dataframes, didn't work for my purposes. I understand that it may also require the same attribute indexes for each category but that won't work for me because 'Odor' and 'Cap Color' definitely don't have the same attribute options.


Solution

  • You can split your column names on _ and then use MultiIndex.from_tuples to create a new multi-level index:

    df.columns = pd.MultiIndex.from_tuples(col.split('_') for col in df.columns)
    

    Output:

      Cap Shape
           Bell Conical Flat
    0         1       0    0