Suppose I have a train set like this:
Name | day
------------
First | 0
Second | 1
Third | 1
Forth | 2
And a test set that does not contain all these names or days. Like so:
Name | day
------------
First | 2
Second | 1
Forth | 0
I have the following code to transform these columns in encoded features:
features_to_encode = ['Name', 'day']
label_final = pd.DataFrame()
for feature in features_to_encode:
label_campaign = LabelBinarizer()
label_results = label_campaign.fit_transform(df[feature])
label_results = pd.DataFrame(label_results, columns=label_campaign.classes_)
label_final = pd.concat([label_final, label_results], axis=1)
df_encoded = label_final.join(df)
To produce the following output on train (this works fine):
First | Second | Third | Forth | 0 | 1 | 2 |
-----------------------------------------------
1 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 1 |
However, when I run this on test data (new data), I get mismatching features if the test data does not contain exactly all the same Names and days as train data. So if I run similar code on this test sample, I would get:
First | Second | Forth | 0 | 1 | 2 |
--------------------------------------
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 1 | 0 | 0 |
What can I do to preserve the same transformation from train data and apply it correctly to test data, resulting in this desired output:
First | Second | Third | Forth | 0 | 1 | 2 |
-----------------------------------------------
1 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 | 1 | 0 | 0 |
I have already tried adding a dict to catch the fit_transform results, but I am not sure if this works or what to do with it afterwards:
features_to_encode = ['Name', 'day']
label_final = pd.DataFrame()
labels = {}--------------------------------------------------------------------> TRIED THIS
for feature in features_to_encode:
label_campaign = LabelBinarizer()
label_results = label_campaign.fit_transform(df[feature])
labels[feature] = label_results--------------------------------------------> WITH THIS
label_results = pd.DataFrame(label_results, columns=label_campaign.classes_)
label_final = pd.concat([label_final, label_results], axis=1)
df_encoded = label_final.join(df)
Any help is appreciated. Thanks =)
Another approach, maybe better suited in case you have common values among different variables and in case you plan to automate code for several columns to encode:
df_train = pd.DataFrame({'Name': ['First', 'Second', 'Third', 'Fourth'], 'Day': [2,1,1,2]})
df_test = pd.DataFrame({'Name': ['First', 'Second', 'Fourth'], 'Day': [2,1,0]})
categories = []
cols_to_encode = ['Name', 'Day']
# Union of all values in both training and testing datasets:
for col in cols_to_encode:
categories.append(list(set(df_train[col].unique().tolist() + df_test[col].unique().tolist())))
# Sorts the class names under each variable
for cat in categories:
cat.sort()
for col_name, cat in zip(cols_to_encode, categories):
df_test[col_name] = pd.Categorical(df_test[col_name], categories=cat)
df_test = pd.get_dummies(df_test, columns=cols_to_encode)
df_test
>>
Name_First Name_Second Name_Third Name_Fourth Day_0 Day_1 Day_2 Day_3 Day_4
0 1 0 0 0 0 0 1 0 0
1 0 1 0 0 0 1 0 0 0
2 0 0 0 1 1 0 0 0 0