Search code examples
pythonpandasmachine-learningdummy-variable

how to get pandas get_dummies to emit N-1 variables to avoid collinearity?


pandas.get_dummies emits a dummy variable per categorical value. Is there some automated, easy way to ask it to create only N-1 dummy variables? (just get rid of one "baseline" variable arbitrarily)?

Needed to avoid co-linearity in our dataset.


Solution

  • Pandas version 0.18.0 implemented exactly what you're looking for: the drop_first option. Here's an example:

    In [1]: import pandas as pd
    
    In [2]: pd.__version__
    Out[2]: u'0.18.1'
    
    In [3]: s = pd.Series(list('abcbacb'))
    
    In [4]: pd.get_dummies(s, drop_first=True)
    Out[4]: 
         b    c
    0  0.0  0.0
    1  1.0  0.0
    2  0.0  1.0
    3  1.0  0.0
    4  0.0  0.0
    5  0.0  1.0
    6  1.0  0.0