Search code examples
pythonpandasdummy-data

Pandas: map values of categorical variable to a predefined list of dummy columns


I have a categorical variable with known levels (e.g. hour that just contains values between 0 and 23), but not all of them are available right now (say, we have measurements from between 0 and 11 o'clock, while hours from 12 to 23 are not covered), though other values are going to be added later. If we naively use pandas.get_dummies() to map values to indicator variables, we will end up with only 12 of them instead of 24. Is there a way to map values of the categorical variable to a predefined list of dummy variables?

Here's an example of expected behaviour:

possible_values = range(24)
hours = get_dummies_on_steroids(df['hour'], prefix='hour', levels=possible_values)

Solution

  • Using the new and improved Categorical type in pandas 0.15:

    import pandas as pd
    import numpy as np
    df = pd.DataFrame({'hour': [0, 1, 3, 8, 13, 14], 'val': np.random.randn(6)})
    df
    Out[4]: 
       hour       val
    0     0 -0.098287
    1     1 -0.682777
    2     3  1.000749
    3     8 -0.558877
    4    13  1.423675
    5    14  1.461552
    
    df['hour_cat'] = pd.Categorical(df['hour'], categories=range(24))
    pd.get_dummies(df['hour_cat'])
    Out[6]: 
       0   1   2   3   4   5   6   7   8   9  ...  
    0   1   0   0   0   0   0   0   0   0   0 ...      
    1   0   1   0   0   0   0   0   0   0   0 ...   
    2   0   0   0   1   0   0   0   0   0   0 ...   
    3   0   0   0   0   0   0   0   0   1   0 ...   
    4   0   0   0   0   0   0   0   0   0   0 ...   
    5   0   0   0   0   0   0   0   0   0   0 ...
    

    The situation you describe, where you know your data can take a specific set of values, but you haven't necessarily observed all of them, is exactly what Categorical is good for.