Context: When preprocessing a data set using sklearn, you use fit_transform
on the training set and transform
on the test set, to avoid data leakage. Using leave one out (LOO) encoding, you need the target variable value to calculate the encoded value of a feature value. When using the LOO encoder in a pipeline, you can apply it to the training set using the fit_transform
function, which accepts the features (X
) and the target values (y
).
How do I calculate the LOO encodings for the test set with the same pipeline, knowing that transform
does not accept the target variable values as an argument? I'm quite confused about this. The transform
function indeed transforms the columns but without considering the value of the target, since it doesn't have that information.
You shouldn't need the target variable of the test set while applying leave-one-out (or any other) encoding. Even if you somehow managed to pass it when you do your offline evaluations on the test set, how will you actually apply it during inference time? During inference time when your model is serving traffic from real users, obviously the true label wouldn't be available. And you should always compute your test metrics such that they are representative of what happens in the real world. So conceptually, it seems wrong to use the test labels to do feature encoding.
I looked up the source code of leave-one-out encoding in the category_encoders
package and it's apparent that they find the mean target without leaving the current example out when the target variable is not supplied
# Replace level with its mean target; if level occurs only once, use global mean
level_means = (colmap['sum'] / colmap['count']).where(level_notunique, self._mean)
So if I would just use the encoder like this
import category_encoders as ce
from sklearn.model_selection import train_test_split
import pandas as pd
dataframe = pd.DataFrame({
'f1': ['P', 'Q', 'P', 'Q', 'P', 'P', 'Q', 'Q'],
'f2': ['M', 'N', 'M', 'N', 'M', 'N', 'M', 'N'],
'f3': ['A', 'B', 'C', 'C', 'C', 'C', 'A', 'C'],
'y': [1, 0, 1, 0, 1, 1, 0, 0]
})
train_data, test_data = train_test_split(dataframe, test_size=0.2)
encoder = ce.LeaveOneOutEncoder(cols=['f1', 'f2', 'f3'])
encoded_train = encoder.fit_transform(train_data, train_data['y'])
encoded_test = encoder.transform(test_data)