I have a categorical variable with known levels (e.g. hour
that just contains values between 0 and 23), but not all of them are available right now (say, we have measurements from between 0 and 11 o'clock, while hours from 12 to 23 are not covered), though other values are going to be added later. If we naively use pandas.get_dummies()
to map values to indicator variables, we will end up with only 12 of them instead of 24. Is there a way to map values of the categorical variable to a predefined list of dummy variables?
Here's an example of expected behaviour:
possible_values = range(24)
hours = get_dummies_on_steroids(df['hour'], prefix='hour', levels=possible_values)
Using the new and improved Categorical
type in pandas 0.15:
import pandas as pd
import numpy as np
df = pd.DataFrame({'hour': [0, 1, 3, 8, 13, 14], 'val': np.random.randn(6)})
df
Out[4]:
hour val
0 0 -0.098287
1 1 -0.682777
2 3 1.000749
3 8 -0.558877
4 13 1.423675
5 14 1.461552
df['hour_cat'] = pd.Categorical(df['hour'], categories=range(24))
pd.get_dummies(df['hour_cat'])
Out[6]:
0 1 2 3 4 5 6 7 8 9 ...
0 1 0 0 0 0 0 0 0 0 0 ...
1 0 1 0 0 0 0 0 0 0 0 ...
2 0 0 0 1 0 0 0 0 0 0 ...
3 0 0 0 0 0 0 0 0 1 0 ...
4 0 0 0 0 0 0 0 0 0 0 ...
5 0 0 0 0 0 0 0 0 0 0 ...
The situation you describe, where you know your data can take a specific set of values, but
you haven't necessarily observed all of them, is exactly what Categorical
is good for.