python machine-learning one-hot-encoding

save and load one hot encoding for ML

I have been searching for two days now and at seems I cannot grasp the solution. For a machine learning regression model, I need a hot encoding of some columns. The training data and model fitting is happening on my local PC. After this the model will be uploaded to the server for predictions.

The problem is that new data was not part of initial encoding so I need to hot encode it in same way as learning data on my PC. I found out that I can save the encoder (sklearn.preprocessing -> OneHotEncoder). But I cannot manage to get the data into the correct format.

To make it easier to understand here I just created a notebook with some very simple dummy data.

# Import pandas library 
import pandas as pd 
# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 
# print dataframe.
df

Output:

Name Age

tom 10

nick 15

juli 14

# hot encoding
hot_Name = pd.get_dummies(df.Name)
X = pd.concat((df[['Age']], hot_Name), axis=1)
X

Output:

Age juli nick tom

10 0 0 1

15 0 1 0

14 1 0 0

# outside data
# initialize list of lists 
data_new = [['michael', 20], ['juli', 45]] 
# Create the pandas DataFrame 
df_new = pd.DataFrame(data_new, columns = ['Name', 'Age']) 
# print dataframe. 
df_new

Output:

Name Age

michael 20

juli 45

Is it possible to encode "data_new" the same way as "data" and save the Encoder for latter use on live incoming data?

Expected hot encoding to be used in th Model for df_new:

Age juli nick tom

20 0 0 0

45 1 0 0

Solution

To my knowledge, pandas does not expose a method to serialise encoding done with get_dummies. I'd use OneHotEncoder directly to encode the variables and then joblib to serialise it.

import joblib
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

col_names = ['name', 'age']
data = [['tom', 10], ['nick', 15], ['juli', 14]] 

enc = OneHotEncoder(handle_unknown='error')
enc.fit(data)
joblib.dump(enc, 'encoder.joblib')

Then on the server:

enc = joblib.load('encoder.joblib')
data_df = pd.DataFrame(data=data, columns=col_names)
enc_df = pd.DataFrame(data=enc.transform(data).toarray(), columns=enc.get_feature_names(col_names), dtype=bool)
df = pd.concat([data_df, enc_df], axis=1)

Output for df:

|   | name | age | name_juli | name_nick | name_tom | age_10 | age_14 | age_15 |
|---|------|-----|-----------|-----------|----------|--------|--------|--------|
| 0 | tom  | 10  | False     | False     | True     | True   | False  | False  |
| 1 | nick | 15  | False     | True      | False    | False  | False  | True   |
| 2 | juli | 14  | True      | False     | False    | False  | True   | False  |