I have created a unique member-level dataframe that looks like this:
memberid codes cost
memberA {c1, c2} 100.0
memberB {c2, c3} 120.0
memberC {c1, c5} 200.0
The values of codes
are sets. I am trying to get some correlations between individual codes and cost, in order to eventually build a predictive model using some other features I have. Outside Pandas I have created a Counter()
that sorted the codes by prevalence:
all_codes = []
for entry in df['codes'].values:
for code in entry:
all_codes.append(code)
common_code_info = Counter(all_codes).most_common()
common_codes = [el[0] for el in common_code_info]
There is a total of ~500 codes in common_codes
, and I would like to create the same amount of dummy variables, but I am not sure how can I do it in Pandas. I tried something like:
for code in common_codes:
if code in df['codes'].values:
df['has_'+code] = 1
else:
df['has_'+code] = 0
but it didn't work (all dummy columns are zeroes). Is there a simple way to populate these dummy columns, since get_dummies
can not be used because the "source" of the potential dummy variables lives outside the dataframe? Unless there is a simpler way to do everything I am looking for just using Pandas.
EDIT:
The codes are diagnostic codes, so they have values like C801
, R911
, etc. The resulting dataframe should look like:
memberid codes cost has_c1 has_c2 has_c3
memberA {c1, c2} 100.0 1 1 0
memberB {c2, c3} 120.0 0 1 1
memberC {c1, c5} 200.0 1 0 0
I have also tried:
for code in common_codes:
df['has_'+code] = np.where(code in df['codes'], 1, 0)
but this didn't work either.
Code for the sample dataframe:
data = {'memberid': ['memberA', 'memberB', 'memberC'],
'codes': [{c1, c2}, {c2, c3}, {c1, c5}],
'cost': [100.0, 120.0, 200.0]}
df = pd.DataFrame(data, columns = ['memberid', 'codes', 'cost'])
One way is to use pd.get_dummies
.
The slight complication is you need to convert set
to list
first as get_dummies
requires an ordered collection.
import pandas as pd
data = {'memberid': ['memberA', 'memberB', 'memberC'],
'codes': [{'c1', 'c2'}, {'c2', 'c3'}, {'c1', 'c5'}],
'cost': [100.0, 120.0, 200.0]}
df = pd.DataFrame(data, columns = ['memberid', 'codes', 'cost'])
dummies = pd.get_dummies(df['codes'].apply(list).apply(pd.Series).stack()).sum(level=0)
res = df.join(dummies)
print(res)
memberid codes cost c1 c2 c3 c5
0 memberA {c2, c1} 100.0 1 1 0 0
1 memberB {c2, c3} 120.0 0 1 1 0
2 memberC {c5, c1} 200.0 1 0 0 1