Search code examples
pythonpandasdataframegroup-bydummy-variable

How to set DataFrame dummy/indicator columns for a small subset of codes in multiple categorical columns


I have the following pandas DataFrame:

enter image description here

 d = {"account_number": [1234, 5678, 9012, 1234, 5678, 9012, 1234, 5678, 9012, 1234, 5678, 9012],
      "c1": ["V89", "M54", "T14", "V89", "M54", "S13", "V89", "M62", "T14", "V43", "M54", None],
      "c2": ["V43", "M54", None, "T14", "M54", None, "G89", None, "T14", "S59", None, None],
      "c3": [None, None, None, "LML", None, None, "G89", None, None, "V29", None, None]}
 df = pd.DataFrame(data=d)

that contains three columns of 3-character codes, c1, c2, and c3. Over 11,000 codes can be in any one of these columns, but I only care about 15 of them. So, pd.get_dummies() does not meet my needs.

I set up 15 dummy/indicator columns based on the first column of codes:

enter image description here

codes = ["V89", "M54", "V49", "R07", "V43", "S16", "R51", "S52", "S59", "M99", "T14", "S13", "S61", "V29", "V87"]
df[codes] = df["c1"].apply(lambda x: pd.Series([int(code == x.upper()) if x is not None else 0 for code in codes]))

Next, I want to set the indicator/dummy columns for the values in columns c2 and c3:

enter image description here

code_columns = ["c2", "c3"]
for col in code_columns:
    for index, row in df.iterrows():
        code = row[col]
        if code in codes:
            df.loc[index, code] = 1

And finally, I want to group the DataFrame by account_number, and have the indicator/dummy columns indicate if any of the 15 codes ever appear for any record corresponding to that account (and there can be hundreds or thousands of records per account_number):

enter image description here

agg_dict = {key: "max" for key in df.columns if key in codes}
df = df.groupby("account_number").agg(agg_dict).reset_index()

While I got the desired end result, I have to believe there is a much more elegant and brief solution in Python, especially considering I had to use df.iterrows() when setting the indicator/dummy columns for the values in other columns. I'm afraid this will become a performance and/or memory issue when I attempt to run this solution on a dataset with five code columns and millions of records.

So, that's really my question to all you Python experts: is there a better way to solve this problem than the solution I came up with?


Solution

  • You can keep only the values that are isin you 15 codes with where, stack the result to remove the NaN and use str.get_dummies to get a column per code possible. Then groupby.max the account_number that you have been set as index before with set_index. Finally reindex to get all the codes possible.

    res = (
        df.set_index('account_number')[['c1','c2','c3']]
          .where(lambda x: x.isin(codes))
          .stack()
          .str.get_dummies()
          .groupby(level='account_number').max()
          .reindex(columns=codes, fill_value=0)
    )
    print(res)
    #                 V89  M54  V49  R07  V43  S16  R51  S52  S59  M99  T14  S13  \
    # account_number                                                               
    # 1234              1    0    0    0    1    0    0    0    1    0    1    0   
    # 5678              0    1    0    0    0    0    0    0    0    0    0    0   
    # 9012              0    0    0    0    0    0    0    0    0    0    1    1   
    
    #                 S61  V29  V87  
    # account_number                 
    # 1234              0    1    0  
    # 5678              0    0    0  
    # 9012              0    0    0