Search code examples
pythonpandasdummy-variable

batch-populate dummy variable columns in pandas


I have created a unique member-level dataframe that looks like this:

memberid      codes       cost
memberA       {c1, c2}    100.0
memberB       {c2, c3}    120.0
memberC       {c1, c5}    200.0

The values of codes are sets. I am trying to get some correlations between individual codes and cost, in order to eventually build a predictive model using some other features I have. Outside Pandas I have created a Counter() that sorted the codes by prevalence:

all_codes = []
for entry in df['codes'].values:
    for code in entry:
        all_codes.append(code)

common_code_info = Counter(all_codes).most_common()
common_codes = [el[0] for el in common_code_info]

There is a total of ~500 codes in common_codes, and I would like to create the same amount of dummy variables, but I am not sure how can I do it in Pandas. I tried something like:

for code in common_codes:
    if code in df['codes'].values:
        df['has_'+code] = 1
    else:
        df['has_'+code] = 0

but it didn't work (all dummy columns are zeroes). Is there a simple way to populate these dummy columns, since get_dummies can not be used because the "source" of the potential dummy variables lives outside the dataframe? Unless there is a simpler way to do everything I am looking for just using Pandas.

EDIT: The codes are diagnostic codes, so they have values like C801, R911, etc. The resulting dataframe should look like:

memberid      codes       cost    has_c1    has_c2    has_c3
memberA       {c1, c2}    100.0   1         1         0
memberB       {c2, c3}    120.0   0         1         1
memberC       {c1, c5}    200.0   1         0         0

I have also tried:

for code in common_codes:
    df['has_'+code] = np.where(code in df['codes'], 1, 0)

but this didn't work either.

Code for the sample dataframe:

data = {'memberid': ['memberA', 'memberB', 'memberC'], 
        'codes': [{c1, c2}, {c2, c3}, {c1, c5}], 
        'cost': [100.0, 120.0, 200.0]}
df = pd.DataFrame(data, columns = ['memberid', 'codes', 'cost'])

Solution

  • One way is to use pd.get_dummies.

    The slight complication is you need to convert set to list first as get_dummies requires an ordered collection.

    import pandas as pd
    
    data = {'memberid': ['memberA', 'memberB', 'memberC'], 
            'codes': [{'c1', 'c2'}, {'c2', 'c3'}, {'c1', 'c5'}], 
            'cost': [100.0, 120.0, 200.0]}
    
    df = pd.DataFrame(data, columns = ['memberid', 'codes', 'cost'])
    
    dummies = pd.get_dummies(df['codes'].apply(list).apply(pd.Series).stack()).sum(level=0)
    
    res = df.join(dummies)
    
    print(res)
    
      memberid     codes   cost  c1  c2  c3  c5
    0  memberA  {c2, c1}  100.0   1   1   0   0
    1  memberB  {c2, c3}  120.0   0   1   1   0
    2  memberC  {c5, c1}  200.0   1   0   0   1