one hot encode with pandas get_dummies missing values

I have a dataset in the form of a DataFrame and each row has a label ranging from 1-5. I am doing a one hot encode using pd.get_dummies(). If my dataset has all 5 labels there is not problem. However not all sets contain all 5 numbers so the encode just skips the missing value and creates a problem for new datasets coming in. Can I set a range so that the one hot encode knows there should be 5 labels? Or would I have to append 1,2,3,4,5 to the end of the array before I perform the encode and then delete the last 5 entries?

Correct encode: values 1-5 are encoded

arr = np.array([1,2,5,3,1,5,1,4])

df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))

>>>[[1 0 0 0 0]
    [0 1 0 0 0]
    [0 0 0 0 1]
    [0 0 1 0 0]
    [1 0 0 0 0]
    [0 0 0 0 1]
    [1 0 0 0 0]
    [0 0 0 1 0]]

Missing value encode: this dataset is missing label 4.

arr = np.array([1,2,5,3,1,5,1,])

df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))

>>>[[1 0 0 0]
    [0 1 0 0]
    [0 0 0 1]
    [0 0 1 0]
    [1 0 0 0]
    [0 0 0 1]
    [1 0 0 0]]

Solution

Set up the CategoricalDtype before encoding to ensure all categories are represented when getting dummies:

import numpy as np
import pandas as pd

arr = np.array([1, 2, 5, 3, 1, 5, 1])

df = pd.DataFrame(arr, columns=['test'])

# Setup Categorical Dtype
df['test'] = df['test'].astype(pd.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
hotarr = np.array(pd.get_dummies(df['test']))

print(hotarr)

Alternatively can reindex after get_dummies with fill_value=0 to add the missing columns:

hotarr = np.array(pd.get_dummies(df['test'])
                  .reindex(columns=[1, 2, 3, 4, 5], fill_value=0))

Both produce hotarr with 5 columns even though input does not contain 4:

[[1 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 1]
 [0 0 1 0 0]
 [1 0 0 0 0]
 [0 0 0 0 1]
 [1 0 0 0 0]]