Search code examples

One-hot vs Grouping for Feature Engineering

**note I don't have 10 rep yet so I can't post images

Working with the Adult Census (goal is to predict which observed people will have an annual income greater than $50k/year) dataset for some ML practice and had a question for feature engineering...

The dataset has columns, of which 8 are categorical (workclass, education (dropped because integer education.num exists), marital.status, occupation, relationship, race, sex,, and income)


In doing analysis, I first changed income to 1 for >$50K/year and 0 for <$50K/year.

data['income'] = data['income'].replace({'<=50K':0, '>50K' :1})

However, when looking at the other variables, I needed some guidance/advice on how to approach them. For example, the 'workclass' column

plt.figure(figsize = (15,5))
sns.barplot(x = data['workclass'], y = data['income'])
plt.xlabel('Working Class')
plt.ylabel('Likelihood of income >= 50K')


My first idea was to use one-hot encoding, however, like workclass,, race,marital.status, and occupation are all unordered. This would create nearly 100 columns.

My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below


Going by this, my inclination for each column would be

Column Feature Engineering Decision
workclass drop
marital.status Group (married+present = 1, not married/estranged = 0
occupation Group (white collar jobs (exec,prof,tech,sales) = 1, blue collar (all else) = 0
race Unsure, only 5 variables so could one-hot or group by white vs non-white?
relationship Group (Husband or Wife = 1, No marital relationship = 0
sex One-hot, or Male = 1, Female = 0, Unsure need input Tons of variables, I think Group by US vs non-US makes most sense

Here is a link to the full jupyter notebook with graphs for all categorical variables. So, can you help me decide if this is the right way to feature engineer the columns in my dataset?


  • Although Adult Census is a classic toy dataset, I can't recall all the details so my answer may not be as informative as you expect. Still, I love everything about your decisions except maybe this idea:

    My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below

    Since your goal is to build a model that predicts income status for unobserved individuals, associating your categories directly with target will likely lead to overfitting; in test data, the distributions of categories can differ from those in training data. However, it does not mean that you shouldn't consider target classes at all. Indeed, try calculating probabilities on subsamples of your category columns or adding noise to the obtained probabilities. See category_encoders package for inspiration. Good luck!

    P.S. Are workclass and education.num really identical?🤔