Search code examples
pythonpandasdummy-variabledummy-data

Need help creating a pseudo-dummy variable that instead of '1' uses the value from another column


I have a dataframe that looks like this :

A     B    C

34    x    a
3     y    b
23    y    a
40    x    b

Essentially, cols B and C need to become dummy variables, with headers B_x, B_y, C_a, C_b. The function is almost exactly how get_dummies() works in pandas, with one major difference: I need the value to be the value in column A for all dummy variables created where the value would be 1. Something like

A     B_x   B_y  C_a C_b

34    34    0    34  0
3     0     3    0   3
23    0     23   23  0
40    40    0    0   40

I'm working with fairly large data with a high number of categories.

I've tried using get_dummies() on the dataset and then df.mask to change all 1's to df.A, however this is atrociously slow (about 10min).


Solution

  • Use pd.get_dummies and broadcast column A

    df2 = pd.get_dummies(df[['B', 'C']]) * df.A.values.reshape([-1,1])
    
        B_x B_y C_a C_b
    0   34  0   34  0
    1   0   3   0   3
    2   0   23  23  0
    3   40  0   0   40
    

    To assign back A, there are Many alternatives. Can do df2['A'] = df['A'] or use pd.concat

    pd.concat([df.A, df2], axis=1)