I have the following pandas DataFrame:
d = {"account_number": [1234, 5678, 9012, 1234, 5678, 9012, 1234, 5678, 9012, 1234, 5678, 9012],
"c1": ["V89", "M54", "T14", "V89", "M54", "S13", "V89", "M62", "T14", "V43", "M54", None],
"c2": ["V43", "M54", None, "T14", "M54", None, "G89", None, "T14", "S59", None, None],
"c3": [None, None, None, "LML", None, None, "G89", None, None, "V29", None, None]}
df = pd.DataFrame(data=d)
that contains three columns of 3-character codes, c1, c2, and c3. Over 11,000 codes can be in any one of these columns, but I only care about 15 of them. So, pd.get_dummies() does not meet my needs.
I set up 15 dummy/indicator columns based on the first column of codes:
codes = ["V89", "M54", "V49", "R07", "V43", "S16", "R51", "S52", "S59", "M99", "T14", "S13", "S61", "V29", "V87"]
df[codes] = df["c1"].apply(lambda x: pd.Series([int(code == x.upper()) if x is not None else 0 for code in codes]))
Next, I want to set the indicator/dummy columns for the values in columns c2 and c3:
code_columns = ["c2", "c3"]
for col in code_columns:
for index, row in df.iterrows():
code = row[col]
if code in codes:
df.loc[index, code] = 1
And finally, I want to group the DataFrame by account_number, and have the indicator/dummy columns indicate if any of the 15 codes ever appear for any record corresponding to that account (and there can be hundreds or thousands of records per account_number):
agg_dict = {key: "max" for key in df.columns if key in codes}
df = df.groupby("account_number").agg(agg_dict).reset_index()
While I got the desired end result, I have to believe there is a much more elegant and brief solution in Python, especially considering I had to use df.iterrows() when setting the indicator/dummy columns for the values in other columns. I'm afraid this will become a performance and/or memory issue when I attempt to run this solution on a dataset with five code columns and millions of records.
So, that's really my question to all you Python experts: is there a better way to solve this problem than the solution I came up with?
You can keep only the values that are isin
you 15 codes
with where
, stack
the result to remove the NaN and use str.get_dummies
to get a column per code possible. Then groupby.max
the account_number that you have been set as index before with set_index
. Finally reindex
to get all the codes
possible.
res = (
df.set_index('account_number')[['c1','c2','c3']]
.where(lambda x: x.isin(codes))
.stack()
.str.get_dummies()
.groupby(level='account_number').max()
.reindex(columns=codes, fill_value=0)
)
print(res)
# V89 M54 V49 R07 V43 S16 R51 S52 S59 M99 T14 S13 \
# account_number
# 1234 1 0 0 0 1 0 0 0 1 0 1 0
# 5678 0 1 0 0 0 0 0 0 0 0 0 0
# 9012 0 0 0 0 0 0 0 0 0 0 1 1
# S61 V29 V87
# account_number
# 1234 0 1 0
# 5678 0 0 0
# 9012 0 0 0