Search code examples
pythonpandasloops

Populating other columns in the dataframe based on column-names present in one column


I have a dataframe with ~300 columns, having columns from 'Prod1' to 'Prod300' , just like below :

ID   Product Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
01     1      0     0     0     0     0     0     0
01     2      0     0     0     0     0     0     0
01     3      0     0     0     0     0     0     0
02     1      0     0     0     0     0     0     0
02     4      0     0     0     0     0     0     0
03     3      0     0     0     0     0     0     0
04     6      0     0     0     0     0     0     0

What I want is to simply populate the corresponding 'Prod' column according to the value in 'Product' column, i.e. the below df as Output -

ID   Product Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
01     1      1     0     0     0     0     0     0
01     2      0     1     0     0     0     0     0
01     3      0     0     1     0     0     0     0
02     1      1     0     0     0     0     0     0
02     4      0     0     0     1     0     0     0
03     3      0     0     1     0     0     0     0
04     6      0     0     0     0     0     1     0

The thing is, I know I have only 300 products in total, but I may not get all the values inside the 'Product' column and I want to keep the 'Prod7' column too even if it isnt present in the 'Product' column and have all values = 0. So, I cant use pivot_table...

I have tried looping through the 'Product' column, like below :

for row in range ( len( df['Product'])) :
   df['Prod' + str( df['Product'][row] ) ][row] = 1

what I am doing above is going through the values in the 'Product' column one by one, and then populating 1 in the corresponding column and same row.

Also, since there are 300 columns from Prod1 to Prod300 I would like not to do it column by column like below :

df['Prod1'] = np.where(df['Product'] == 1, 1 , 0)
df['Prod2'] = np.where(df['Product'] == 2, 1 , 0)
df['Prod3'] = np.where(df['Product'] == 3, 1 , 0)
........
........

Now the above codes works, but I know you shouldn't use for loops since they are slow.

Maybe its because I am new to python, and its really simple but any alternative will be much appreciated. I am also concerned about the Run-time of the code so anything faster than before will be very much helpful...


Solution

  • Empty columns are not necessary here.

    Use get_dummies for new indicator columns, add DataFrame.reindex for add non exist columns by range and last DataFrame.add_prefix:

    df1 = (pd.get_dummies(df['Product'])
             .reindex(range(1, 8), axis=1, fill_value=0)
             .add_prefix('Prod'))
    print (df1)
       Prod1  Prod2  Prod3  Prod4  Prod5  Prod6  Prod7
    0      1      0      0      0      0      0      0
    1      0      1      0      0      0      0      0
    2      0      0      1      0      0      0      0
    3      1      0      0      0      0      0      0
    4      0      0      0      1      0      0      0
    5      0      0      1      0      0      0      0
    6      0      0      0      0      0      1      0
    

    And then is possible use DataFrame.join with first 2 columns of original DataFrame:

    df2 = df.iloc[:, :2].join(df1)
    print (df2)
       ID  Product  Prod1  Prod2  Prod3  Prod4  Prod5  Prod6  Prod7
    0   1        1      1      0      0      0      0      0      0
    1   1        2      0      1      0      0      0      0      0
    2   1        3      0      0      1      0      0      0      0
    3   2        1      1      0      0      0      0      0      0
    4   2        4      0      0      0      1      0      0      0
    5   3        3      0      0      1      0      0      0      0
    6   4        6      0      0      0      0      0      1      0