Search code examples
pythonpandasencodingscikit-learnone-hot-encoding

How to use LabelBinarizer to one hot encode both train and test correctly


Suppose I have a train set like this:

Name   | day
------------
First  |  0
Second |  1
Third  |  1
Forth  |  2

And a test set that does not contain all these names or days. Like so:

Name   | day
------------
First  |  2
Second |  1
Forth  |  0

I have the following code to transform these columns in encoded features:

features_to_encode = ['Name', 'day']
label_final = pd.DataFrame()

for feature in features_to_encode:
    label_campaign = LabelBinarizer()
    label_results = label_campaign.fit_transform(df[feature])
    label_results = pd.DataFrame(label_results, columns=label_campaign.classes_)
    label_final = pd.concat([label_final, label_results], axis=1)

df_encoded = label_final.join(df)

To produce the following output on train (this works fine):

First  | Second  | Third  | Forth | 0 | 1 | 2 | 
-----------------------------------------------
  1    |    0    |   0    |   0   | 1 | 0 | 0 |
  0    |    1    |   0    |   0   | 0 | 1 | 0 |
  0    |    0    |   1    |   0   | 0 | 1 | 0 |
  0    |    0    |   0    |   1   | 0 | 0 | 1 |

However, when I run this on test data (new data), I get mismatching features if the test data does not contain exactly all the same Names and days as train data. So if I run similar code on this test sample, I would get:

First  | Second  | Forth | 0 | 1 | 2 | 
--------------------------------------
  1    |    0    |   0   | 0 | 0 | 1 |
  0    |    1    |   0   | 0 | 1 | 0 |
  0    |    0    |   1   | 1 | 0 | 0 |

What can I do to preserve the same transformation from train data and apply it correctly to test data, resulting in this desired output:

First  | Second  | Third  | Forth | 0 | 1 | 2 | 
-----------------------------------------------
  1    |    0    |   0    |   0   | 0 | 0 | 1 |
  0    |    1    |   0    |   0   | 0 | 1 | 0 |
  0    |    0    |   0    |   1   | 1 | 0 | 0 |

I have already tried adding a dict to catch the fit_transform results, but I am not sure if this works or what to do with it afterwards:

features_to_encode = ['Name', 'day']
label_final = pd.DataFrame()

labels = {}--------------------------------------------------------------------> TRIED THIS
for feature in features_to_encode:
    label_campaign = LabelBinarizer()
    label_results = label_campaign.fit_transform(df[feature])
    labels[feature] = label_results--------------------------------------------> WITH THIS
    label_results = pd.DataFrame(label_results, columns=label_campaign.classes_)
    label_final = pd.concat([label_final, label_results], axis=1)

df_encoded = label_final.join(df)

Any help is appreciated. Thanks =)


Solution

  • Another approach, maybe better suited in case you have common values among different variables and in case you plan to automate code for several columns to encode:

    df_train = pd.DataFrame({'Name': ['First', 'Second', 'Third', 'Fourth'], 'Day': [2,1,1,2]})
    df_test = pd.DataFrame({'Name': ['First', 'Second', 'Fourth'], 'Day': [2,1,0]})
    categories = []
    
    cols_to_encode = ['Name', 'Day']
    # Union of all values in both training and testing datasets:
    for col in cols_to_encode:
        categories.append(list(set(df_train[col].unique().tolist() + df_test[col].unique().tolist())))
    
    # Sorts the class names under each variable    
    for cat in categories:
        cat.sort()
    
    for col_name, cat in zip(cols_to_encode, categories):
        df_test[col_name] =  pd.Categorical(df_test[col_name], categories=cat)
    df_test = pd.get_dummies(df_test, columns=cols_to_encode)
    
    df_test
    
    >>
    
        Name_First  Name_Second Name_Third  Name_Fourth Day_0   Day_1   Day_2   Day_3   Day_4
    0   1           0           0           0           0       0       1       0    0 
    1   0           1           0           0           0       1       0       0    0
    2   0           0           0           1           1       0       0       0    0