I am trying to prepare a training dataset which contains many categorical columns with high cardinality to train a machine learning model. Therefore, I want to target encoding them so that I convert the categorical columns into numerical columns. Label encoding is not suitable because the categorical features are not ordinal
My train dataset looks like this where I have only taken 4 columns out of 20 columns
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | city1 | james | 25-55 | abc |
20 | city2 | adam | 30-40 | bcc |
15 | city1 | charles | 30-40 | bcc |
I want to write an efficient code to target encode all the categorical columns without individually having to do each column.
The resulting training dataframe should look like
target | cat_col1 | cat_col2 | cat_col3 | cat_col4 |
---|---|---|---|---|
10 | 15 | 10 | 10 | 10 |
20 | 20 | 20 | 17 | 17 |
15 | 15 | 15 | 17 | 17 |
I can get the above output by writing code for each column but since I have 20 categorical, this does not seem efficient.
encoder = TargetEncoder()
train['cat_col1'] = encoder.fit_transform(train['cat_col1'], train['target'])
train['cat_col2'] = encoder.fit_transform(train['cat_col2'], train['target'])
train['cat_col3'] = encoder.fit_transform(train['cat_col3'], train['target'])
train['cat_col4'] = encoder.fit_transform(train['cat_col4'], train['target'])
In addition, I would like to take the target encoded values of the train dataframe and replace all the categories in the test dataframe with the train target encoded values.
Assuming you're using the category_encoders
implementation, it should accept several columns just fine, at least for the recent versions:
cat_cols = ['cat_col1', 'cat_col2', 'cat_col3', 'cat_col4']
train[cat_cols] = encoder.fit_transform(train[cat_cols], train['target'])
test[cat_cols] = encoder.transform(test[cat_cols])
Alternatively, you could use a loop:
for column in cat_cols:
encoder = TargetEncoder()
train[column] = encoder.fit_transform(train[column], train['target'])
test[column] = encoder.transform(test[column])