Search code examples
scikit-learndata-scienceone-hot-encoding

How to preprocess test data after one hot encoding


I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.

Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?

My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.

Sample one hot encoding below, this is for kaggle competition/intermediate course here

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns 

num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

Solution

  • I would advise 2 things:

    • OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
    • Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor