I am working on a dataset of mushroom features, almost all of which I encoded with pandas into binary but some are nominally encoded. I am wondering if I can take the original columns as a second index so that it would look something like this:
Cap Shape:
Bell Conical Flat
1 0 0
rather than:
Cap Shape_Bell Cap Shape_Conical Cap Shape_Flat
1 0 0
This is the code I used to dummy encode them for reference.
mode = df['Stalk Root'].mode() #most common amswer is b
df = df.replace('?', 'b') #replace all question marks with most common value
df['Ring Number'] = df['Ring Number'].replace({'n': 0, 'o': 1, 't': 2}).astype(int)
df['Gill Spacing'] = df['Gill Spacing'].replace({'c': 0, 'w': 1, 'd': 2}).astype(int)
df = pd.get_dummies(df)
df.drop(labels = ['Poisonous_e', 'Bruises_f', 'Gill Size_n', 'Stalk Shape_t', 'Veil Type_p'], axis = 1, inplace = True)
df.rename(columns={'Poisonous_p': 'Poisonous', 'Bruises_t': 'Bruises'}, inplace = True)
I haven't tried much because all the resources I had previously found didn't quite make sense to me. I have looked into pd.Multilevel.index but .from_frame, which is for dataframes, didn't work for my purposes. I understand that it may also require the same attribute indexes for each category but that won't work for me because 'Odor' and 'Cap Color' definitely don't have the same attribute options.
You can split your column names on _
and then use MultiIndex.from_tuples
to create a new multi-level index:
df.columns = pd.MultiIndex.from_tuples(col.split('_') for col in df.columns)
Output:
Cap Shape
Bell Conical Flat
0 1 0 0