I have a dataset in the form of a DataFrame and each row has a label ranging from 1-5. I am doing a one hot encode using pd.get_dummies()
. If my dataset has all 5 labels there is not problem. However not all sets contain all 5 numbers so the encode just skips the missing value and creates a problem for new datasets coming in. Can I set a range so that the one hot encode knows there should be 5 labels? Or would I have to append 1,2,3,4,5
to the end of the array before I perform the encode and then delete the last 5 entries?
Correct encode: values 1-5 are encoded
arr = np.array([1,2,5,3,1,5,1,4])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]
[0 0 0 1 0]]
Missing value encode: this dataset is missing label 4.
arr = np.array([1,2,5,3,1,5,1,])
df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))
>>>[[1 0 0 0]
[0 1 0 0]
[0 0 0 1]
[0 0 1 0]
[1 0 0 0]
[0 0 0 1]
[1 0 0 0]]
Set up the CategoricalDtype
before encoding to ensure all categories are represented when getting dummies:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 5, 3, 1, 5, 1])
df = pd.DataFrame(arr, columns=['test'])
# Setup Categorical Dtype
df['test'] = df['test'].astype(pd.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
hotarr = np.array(pd.get_dummies(df['test']))
print(hotarr)
Alternatively can reindex
after get_dummies
with fill_value=0
to add the missing columns:
hotarr = np.array(pd.get_dummies(df['test'])
.reindex(columns=[1, 2, 3, 4, 5], fill_value=0))
Both produce hotarr
with 5 columns even though input does not contain 4:
[[1 0 0 0 0]
[0 1 0 0 0]
[0 0 0 0 1]
[0 0 1 0 0]
[1 0 0 0 0]
[0 0 0 0 1]
[1 0 0 0 0]]