**note I don't have 10 rep yet so I can't post images
Working with the Adult Census (goal is to predict which observed people will have an annual income greater than $50k/year) dataset for some ML practice and had a question for feature engineering...
The dataset has columns, of which 8 are categorical (workclass, education (dropped because integer education.num exists), marital.status, occupation, relationship, race, sex, native.country, and income)
In doing analysis, I first changed income to 1 for >$50K/year and 0 for <$50K/year.
data['income'] = data['income'].replace({'<=50K':0, '>50K' :1})
However, when looking at the other variables, I needed some guidance/advice on how to approach them. For example, the 'workclass' column
plt.figure(figsize = (15,5))
sns.barplot(x = data['workclass'], y = data['income'])
plt.xlabel('Working Class')
plt.ylabel('Likelihood of income >= 50K')
plt.show()
My first idea was to use one-hot encoding, however, like workclass
, native.country
, race
,marital.status
, and occupation
are all unordered. This would create nearly 100 columns.
My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below
Going by this, my inclination for each column would be
Column | Feature Engineering Decision |
---|---|
workclass | drop |
marital.status | Group (married+present = 1, not married/estranged = 0 |
occupation | Group (white collar jobs (exec,prof,tech,sales) = 1, blue collar (all else) = 0 |
race | Unsure, only 5 variables so could one-hot or group by white vs non-white? |
relationship | Group (Husband or Wife = 1, No marital relationship = 0 |
sex | One-hot, or Male = 1, Female = 0, Unsure need input |
native.country | Tons of variables, I think Group by US vs non-US makes most sense |
Here is a link to the full jupyter notebook with graphs for all categorical variables. So, can you help me decide if this is the right way to feature engineer the columns in my dataset?
Although Adult Census is a classic toy dataset, I can't recall all the details so my answer may not be as informative as you expect. Still, I love everything about your decisions except maybe this idea:
My next idea was to manually group them based on the probability of a certain column value having an income >$50K, picked based on plots like the one below
Since your goal is to build a model that predicts income status for unobserved individuals, associating your categories directly with target will likely lead to overfitting; in test data, the distributions of categories can differ from those in training data. However, it does not mean that you shouldn't consider target classes at all. Indeed, try calculating probabilities on subsamples of your category columns or adding noise to the obtained probabilities. See category_encoders
package for inspiration. Good luck!
P.S. Are workclass
and education.num
really identical?🤔