Search code examples
pythonpandascategorical-data

removing redundant columns when using get_dummies


Hi have a pandas dataframe df containing categorical variables.

df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])

df
Out[16]: 
   gender   eyes
0    male   blue
1  female  brown
2    male  black

using the function get_dummies I get the following dataframe

df_dummies = pandas.get_dummies(df)

df_dummies
Out[18]: 
   gender_female  gender_male  eyes_black  eyes_blue  eyes_brown
0              0            1           0          1           0
1              1            0           0          0           1
2              0            1           1          0           0

Owever the columns gender_female and gender_male contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?

UPDATED

The use of

df_dummies = pandas.get_dummies(df,drop_first=True)

Would give me

df_dummies
Out[21]: 
   gender_male  eyes_blue  eyes_brown
0            1          1           0
1            0          0           1
2            1          0           0

but I would like to remove the columns for which originally I had only 2 possibilities

The desired result should be

df_dummies
Out[18]: 
   gender_male  eyes_black  eyes_blue  eyes_brown
0  1           0          1           0
1  0           0          0           1
2  1           1          0           0

Solution

  • Yes, you can use the argument dropfirst:

    drop_first=True
    

    From the documentation:

    pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b  c
    0  0  0
    1  1  0
    2  0  1
    3  0  0
    4  0  0
    

    To have all dummy columns for eyes, and one for gender, use this:

    df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
    df = pd.get_dummies(df,drop_first=True)
    

    Output:

           eyes_black  eyes_blue  eyes_brown  gender_male
    0           0          1           0            1
    1           0          0           1            0
    2           1          0           0            1
    

    More general:

       gender   eyes    heigh
    0    male   blue     tall
    1  female  brown    short
    2    male  black  average
    
    for i in df.columns:
        if len(df.groupby([i]).size()) > 2:
             df = pd.get_dummies(df, prefix=[i], columns=[i])
    df = pd.get_dummies(df, drop_first=True)
    

    Output:

       eyes_black  eyes_blue  eyes_brown  heigh_average  heigh_short  heigh_tall  \
    0           0          1           0              0            0           1   
    1           0          0           1              0            1           0   
    2           1          0           0              1            0           0    
    
       gender_male  
    0            1  
    1            0  
    2            1