I'm working on a ML project and am doing some preliminary feature selection (When I later train my actual machine learning model I intend to use OneHotEncoding).
To do the features selection I need to convert my categorical variables into numeric codes, like female:0, male:1, other:2. I can't do it manually because I have too many features and values. I'm trying to use cat.codes but I can't get it to tell me what the value corresponds to. E.g. does 0 correspond to male, female, or other?
I've tried 2 methods but neither seem to work
#Example data
import pandas as pd
data = [[14, "Male", "employed"], [89, "Female", "student"], [48, "Other", "employed"]]
df = pd.DataFrame(data, columns=['Age', 'Gender', 'Occupation'])
#Convert categorical feats to numeric values
categorical_feat = ["Gender", "Occupation"]
for col in categorical_feat:
df[col] = df[col].astype("category").cat.codes
#Trying to find out what the numeric values correspond to:
df["Gender"].cat.categories[0] #AttributeError: Can only use .cat accessor with a 'category' dtype
df["Gender"].astype("category").cat.categories[0] #output is 0 ....which isnt what I want. I'm expecting "male" or "female" or "other"
Here is one way which you can probably adapt to suit:
cat_list = []
categorical_feat = ["Gender", "Occupation"]
for col in categorical_feat:
df[col] = df[col].astype("category")
cat_list.append(dict( enumerate(df[col].cat.categories )))
df[col] = df[col].cat.codes
for idx, name in enumerate(categorical_feat):
print(name)
print(cat_list[idx])
print(df)
gives:
Gender
{0: 'Female', 1: 'Male', 2: 'Other'}
Occupation
{0: 'employed', 1: 'student'}
Age Gender Occupation
0 14 1 0
1 89 0 1
2 48 2 0