Search code examples
pythonpandasdummy-variable

Custom Dummy Coding in Pandas


I have a dataframe with event data. I have two columns: Primary and Secondary. The Primary and Secondary columns both contain lists of tags (e.g., ['Fun event', 'Dance party']).

      primary               secondary                      combined
['booze', 'party']    ['singing', 'dance']    ['booze', 'party', 'singing', 'dance']
    ['concert']        ['booze', 'vocals']     ['concert', 'booze', 'vocals']

I want to dummy code the data so that primary columns have a 1 code, non-observed columns have a 0, and values in the secondary column have a .5 value. Like so:

combined                                 booze        party   singing    dance    concert    vocals
['booze', 'party', 'singing', 'dance']     1            1       .5        .5        0           0
['concert', 'booze', 'vocals']            .5            0        0         0        1          .5

Solution

  • df1=pd.get_dummies(df.combined.apply(pd.Series).stack()).sum(level=0)
    df1[df1.apply(lambda x : [x.name in y for y in df.iloc[x.index,2]])]-=0.5
    
    df1
    Out[173]: 
       booze  concert  dance  party  singing  vocals
    0    1.0        0    0.5      1      0.5     0.0
    1    0.5        1    0.0      0      0.0     0.5
    

    Datainput :

    df = pd.DataFrame({'primary':   [['booze', 'party'] ,  ['concert']],
                       'secondary':   [['singing', 'dance'], ['booze', 'vocals']],
                       'combined': [['booze', 'party', 'singing', 'dance'],   ['concert', 'booze', 'vocals']]})