python machine-learning scikit-learn one-hot-encoding feature-engineering

One-hot vs Grouping for Feature Engineering

**note I don't have 10 rep yet so I can't post images

Working with the Adult Census (goal is to predict which observed people will have an annual income greater than $50k/year) dataset for some ML practice and had a question for feature engineering...

The dataset has columns, of which 8 are categorical (workclass, education (dropped because integer education.num exists), marital.status, occupation, relationship, race, sex, native.country, and income)

these

In doing analysis, I first changed income to 1 for >$50K/year and 0 for <$50K/year.

data['income'] = data['income'].replace({'<=50K':0, '>50K' :1})

However, when looking at the other variables, I needed some guidance/advice on how to approach them. For example, the 'workclass' column

plt.figure(figsize = (15,5))
sns.barplot(x = data['workclass'], y = data['income'])
plt.xlabel('Working Class')
plt.ylabel('Likelihood of income >= 50K')
plt.show()

workclass

My first idea was to use one-hot encoding, however, like workclass, native.country, race,marital.status, and occupation are all unordered. This would create nearly 100 columns.

My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below

Martital.Status

Going by this, my inclination for each column would be

Column	Feature Engineering Decision
workclass	drop
marital.status	Group (married+present = 1, not married/estranged = 0
occupation	Group (white collar jobs (exec,prof,tech,sales) = 1, blue collar (all else) = 0
race	Unsure, only 5 variables so could one-hot or group by white vs non-white?
relationship	Group (Husband or Wife = 1, No marital relationship = 0
sex	One-hot, or Male = 1, Female = 0, Unsure need input
native.country	Tons of variables, I think Group by US vs non-US makes most sense

Here is a link to the full jupyter notebook with graphs for all categorical variables. So, can you help me decide if this is the right way to feature engineer the columns in my dataset?

Solution

Although Adult Census is a classic toy dataset, I can't recall all the details so my answer may not be as informative as you expect. Still, I love everything about your decisions except maybe this idea:

My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below

Since your goal is to build a model that predicts income status for unobserved individuals, associating your categories directly with target will likely lead to overfitting; in test data, the distributions of categories can differ from those in training data. However, it does not mean that you shouldn't consider target classes at all. Indeed, try calculating probabilities on subsamples of your category columns or adding noise to the obtained probabilities. See category_encoders package for inspiration. Good luck!

P.S. Are workclass and education.num really identical?🤔