Search code examples
machine-learningencodingdeep-learningtargetcategorical-data

How to use target encoding : expanding mean on the test set


The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use this technic to apply a fit on the train set and a transform on the test set to encode my features, as this encoding technic encode the features dynamically; the encoding value for a given feature level is varying input after input as it depends from a cumulative sum.

cumulative_sum = training.groupby(column)["target"].cumsum() - training["target"]
cumulative_count = training.groupby(column).cumcount()
train_new[column + "_mean_target"] = cumulative_sum/cumulative_count

Solution

  • Shouldn't you simply map the mean values of the target variable calculated for different categories to the corresponding categories in your test set? The cumulative means are needed only for the training part for regularization purposes.