Search code examples
pythonmachine-learningone-hot-encodingtrain-test-split

Do I have to do one-hot-encoding separately for train and test dataset?


I'm working on a classification problem and I've split my data into train and test set.

I have few categorical columns (around 4 -6) and I am thinking of using pd.get_dummies to convert my categorical values to OneHotEncoding.

My question is do I have to do OneHotEncoding separately for train and test split? If that's the case I guess I better use sklearn OneHotEncoder because it supports fit and transform methods, right?


Solution

  • Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation.

    For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category.

    Similarly, when scaling continuous variables say to [0,1], you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of [0,1].


    For completeness, here's how the one-hot encoding might look:

    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    
    ### Correct
    train = pd.DataFrame(['A', 'B', 'A', 'C'])
    test = pd.DataFrame(['B', 'A', 'D'])
    
    enc = OneHotEncoder(handle_unknown = 'ignore')
    enc.fit(train)
    
    enc.transform(train).toarray()
    #array([[1., 0., 0.],
    #       [0., 1., 0.],
    #       [1., 0., 0.],
    #       [0., 0., 1.]])
    
    enc.transform(test).toarray()
    #array([[0., 1., 0.],
    #       [1., 0., 0.],
    #       [0., 0., 0.]])
    
    
    ### Incorrect
    full = pd.concat((train, test))
    
    enc = OneHotEncoder(handle_unknown = 'ignore')
    enc.fit(full)
    
    enc.transform(train).toarray()
    #array([[1., 0., 0., 0.],
    #       [0., 1., 0., 0.],
    #       [1., 0., 0., 0.],
    #       [0., 0., 1., 0.]])
    
    enc.transform(test).toarray()
    #array([[0., 1., 0., 0.],
    #       [1., 0., 0., 0.],
    #       [0., 0., 0., 1.]])
    

    Notice that for the incorrect approach there is an extra column for D (which only shows up in the test set). During training, we wouldn't know about D at all so there shouldn't be a column for it.