Search code examples
pythonpandasnumpyone-hot-encoding

one hot encode with pandas get_dummies missing values


I have a dataset in the form of a DataFrame and each row has a label ranging from 1-5. I am doing a one hot encode using pd.get_dummies(). If my dataset has all 5 labels there is not problem. However not all sets contain all 5 numbers so the encode just skips the missing value and creates a problem for new datasets coming in. Can I set a range so that the one hot encode knows there should be 5 labels? Or would I have to append 1,2,3,4,5 to the end of the array before I perform the encode and then delete the last 5 entries?

Correct encode: values 1-5 are encoded

arr = np.array([1,2,5,3,1,5,1,4])

df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))

>>>[[1 0 0 0 0]
    [0 1 0 0 0]
    [0 0 0 0 1]
    [0 0 1 0 0]
    [1 0 0 0 0]
    [0 0 0 0 1]
    [1 0 0 0 0]
    [0 0 0 1 0]]

Missing value encode: this dataset is missing label 4.

arr = np.array([1,2,5,3,1,5,1,])

df = pd.DataFrame(arr, columns = ['test'])
hotarr = np.array(pd.get_dummies(df['test']))

>>>[[1 0 0 0]
    [0 1 0 0]
    [0 0 0 1]
    [0 0 1 0]
    [1 0 0 0]
    [0 0 0 1]
    [1 0 0 0]]

Solution

  • Set up the CategoricalDtype before encoding to ensure all categories are represented when getting dummies:

    import numpy as np
    import pandas as pd
    
    arr = np.array([1, 2, 5, 3, 1, 5, 1])
    
    df = pd.DataFrame(arr, columns=['test'])
    
    # Setup Categorical Dtype
    df['test'] = df['test'].astype(pd.CategoricalDtype(categories=[1, 2, 3, 4, 5]))
    hotarr = np.array(pd.get_dummies(df['test']))
    
    print(hotarr)
    

    Alternatively can reindex after get_dummies with fill_value=0 to add the missing columns:

    hotarr = np.array(pd.get_dummies(df['test'])
                      .reindex(columns=[1, 2, 3, 4, 5], fill_value=0))
    

    Both produce hotarr with 5 columns even though input does not contain 4:

    [[1 0 0 0 0]
     [0 1 0 0 0]
     [0 0 0 0 1]
     [0 0 1 0 0]
     [1 0 0 0 0]
     [0 0 0 0 1]
     [1 0 0 0 0]]