Search code examples
pythonstringpandascategorical-datadummy-variable

generate many dummies in Pandas when every observation contains a list of possible values


I have a dataframe with one column looking like:

col
A B C
B C X
U

I would like to generate some dummy variables that tell me if a row contains a specific value. That is, in the example, I would like to generate 5 dummy variables (d_A, d_B, d_C, d_X, d_U) so that the data will look like

col      d_A      d_B      d_C      d_X      d_U
A B C    1        1        1        0        0
B C X    0        1        1        1        0
...

I have many, many possible values so I cannot do this easily by hand. Any idea how to do that in pandas (in a vectorized mode)?

Thanks!


Solution

  • Use str.get_dummies and join or concat:

    print df.col.str.get_dummies(sep=' ')
       A  B  C  U  X
    0  1  1  1  0  0
    1  0  1  1  0  1
    2  0  0  0  1  0
    
    print df.join(df.col.str.get_dummies(sep=' '))
         col  A  B  C  U  X
    0  A B C  1  1  1  0  0
    1  B C X  0  1  1  0  1
    2      U  0  0  0  1  0
    

    If you need change columns names use list comprehension:

    df1 = df.col.str.get_dummies(sep=' ')
    df1.columns = ['d_' + x for x in df1.columns]
    print df1
       d_A  d_B  d_C  d_U  d_X
    0    1    1    1    0    0
    1    0    1    1    0    1
    2    0    0    0    1    0
    
    print df.join(df1)
         col  d_A  d_B  d_C  d_U  d_X
    0  A B C    1    1    1    0    0
    1  B C X    0    1    1    0    1
    2      U    0    0    0    1    0
    
    print pd.concat([df, df1], axis=1)
         col  d_A  d_B  d_C  d_U  d_X
    0  A B C    1    1    1    0    0
    1  B C X    0    1    1    0    1
    2      U    0    0    0    1    0