I have been searching for two days now and at seems I cannot grasp the solution. For a machine learning regression model, I need a hot encoding of some columns. The training data and model fitting is happening on my local PC. After this the model will be uploaded to the server for predictions.
The problem is that new data was not part of initial encoding so I need to hot encode it in same way as learning data on my PC. I found out that I can save the encoder (sklearn.preprocessing -> OneHotEncoder). But I cannot manage to get the data into the correct format.
To make it easier to understand here I just created a notebook with some very simple dummy data.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
df
Output:
Name Age
tom 10
nick 15
juli 14
# hot encoding
hot_Name = pd.get_dummies(df.Name)
X = pd.concat((df[['Age']], hot_Name), axis=1)
X
Output:
Age juli nick tom
10 0 0 1
15 0 1 0
14 1 0 0
# outside data
# initialize list of lists
data_new = [['michael', 20], ['juli', 45]]
# Create the pandas DataFrame
df_new = pd.DataFrame(data_new, columns = ['Name', 'Age'])
# print dataframe.
df_new
Output:
Name Age
michael 20
juli 45
Is it possible to encode "data_new" the same way as "data" and save the Encoder for latter use on live incoming data?
Expected hot encoding to be used in th Model for df_new:
Age juli nick tom
20 0 0 0
45 1 0 0
To my knowledge, pandas
does not expose a method to serialise encoding done with get_dummies
. I'd use OneHotEncoder
directly to encode the variables and then joblib
to serialise it.
import joblib
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
col_names = ['name', 'age']
data = [['tom', 10], ['nick', 15], ['juli', 14]]
enc = OneHotEncoder(handle_unknown='error')
enc.fit(data)
joblib.dump(enc, 'encoder.joblib')
Then on the server:
enc = joblib.load('encoder.joblib')
data_df = pd.DataFrame(data=data, columns=col_names)
enc_df = pd.DataFrame(data=enc.transform(data).toarray(), columns=enc.get_feature_names(col_names), dtype=bool)
df = pd.concat([data_df, enc_df], axis=1)
Output for df
:
| | name | age | name_juli | name_nick | name_tom | age_10 | age_14 | age_15 |
|---|------|-----|-----------|-----------|----------|--------|--------|--------|
| 0 | tom | 10 | False | False | True | True | False | False |
| 1 | nick | 15 | False | True | False | False | False | True |
| 2 | juli | 14 | True | False | False | False | True | False |