Search code examples
machine-learningregressionone-hot-encodingdata-processingimputation

How to apply same processing pipeline for train and test data when they result in different final features


I'm trying to create a regression model to predict some housing sales and I am facing an issue with processing the train data and test data (this is not the validation data taken from the training set itself) the same way. The steps I'm performing for the processing are follows:

  1. drop the columns with null values >50%
  2. Impute the rest of the columns containing null values
  3. One-hot encode the categorical columns

Say my train data has the following columns (after label extraction) (the ones in ** ** contain null values):

['col1', 'col2', '**col3**', 'col4', '**col5**', 'col6', '**col7**','**col8**', '**col9**', '**col10**', 'col11']

test data has the following columns:

['col1', '**col2**', 'col3', 'col4', 'col5', 'col6', '**col7**', '**col8**', '**col9**', '**col10**', 'col11']

I only drop those columns with >50% null values and the rest of the columns in bold, I impute. Say, in the train data, I will have:

cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col8**', '**col9**','**col10**' ]

And if I retain the same columns to be dropped from test data too, my test data will have the following:

cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col2**', '**col8**', '**col9**','**col10**' ]

The problem now comes with imputation where I have to .fit_transform my imputer with the cols_to_impute in train data and have to .transform the same imputer with the cols_to_impute in the test data since there is a clear difference in the number of features supplied here in both the cols_to_impute lists. (I did this as well and had issues with imputation)

Say, if I keep the same cols_to_impute in both train and test datasets ignoring the null column **col2** of test data, I faced an issue when it came to one-hot encoding saying Nan's need to be handled before encoding. So, how should the processing be done for train and test sets in such cases? Should I be concatenating both of them, perform processing and split them later again? I read about leakage issues in doing this.


Solution

  • Well, you should do the following:

    1. Combine both train and test dataframe, then do the first two steps i.e. dropping the column with nulls and imputing them.
    2. Then, split it back into train and test, then do one hot encoding.

    This would ensure that both the data frames have same columns and there is no leakage in doing one hot encoding.