Search code examples
python-3.xpandasscikit-learnsklearn-pandas

how to apply pandas get_dummies function to valid data set?


I tried to apply pandas get_dummies function to my dataset. The problem is category value's number is not matched train set and valid set. For example, train set column has 5 kind of values. ex : [1, 2, 3, 4, 5] However, valid set has just 3 kind of values. ex : [1, 3, 5]

When I made model by using train dataset there were 5 dummies is being created. ex: dum_1, dum_2, dum_3, dum_4, dum_5

So, if i just used same function for valid data set this will be made only 3 dummies will be created. ex: dum_1, dum_2, dum_3

It is not possible to predict valid data set to use my model. How to make same dummies for train and valid set? (It is not possible to concat 2 dataset. Please suggest another method except using pd.concat)

Also, if I add new column for valid set, I expect it will make different result. because dummies sequence is not matching between train and valid set.

thanks.


Solution

  • All you need to do is

    1. Create columns in the validation dataset which are present in the training data but missing in the validation data.
    missing_cols = [col for col in train.columns if col not in valid.columns]
    for col in missing_cols:
        valid[col] = 0
    
    1. Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
    valid = valid[[train.columns]]