I have a dataframe with ~300 columns, having columns from 'Prod1' to 'Prod300' , just like below :
ID Product Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
01 1 0 0 0 0 0 0 0
01 2 0 0 0 0 0 0 0
01 3 0 0 0 0 0 0 0
02 1 0 0 0 0 0 0 0
02 4 0 0 0 0 0 0 0
03 3 0 0 0 0 0 0 0
04 6 0 0 0 0 0 0 0
What I want is to simply populate the corresponding 'Prod' column according to the value in 'Product' column, i.e. the below df as Output -
ID Product Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
01 1 1 0 0 0 0 0 0
01 2 0 1 0 0 0 0 0
01 3 0 0 1 0 0 0 0
02 1 1 0 0 0 0 0 0
02 4 0 0 0 1 0 0 0
03 3 0 0 1 0 0 0 0
04 6 0 0 0 0 0 1 0
The thing is, I know I have only 300 products in total, but I may not get all the values inside the 'Product' column and I want to keep the 'Prod7' column too even if it isnt present in the 'Product' column and have all values = 0. So, I cant use pivot_table...
I have tried looping through the 'Product' column, like below :
for row in range ( len( df['Product'])) :
df['Prod' + str( df['Product'][row] ) ][row] = 1
what I am doing above is going through the values in the 'Product' column one by one, and then populating 1 in the corresponding column and same row.
Also, since there are 300 columns from Prod1 to Prod300 I would like not to do it column by column like below :
df['Prod1'] = np.where(df['Product'] == 1, 1 , 0)
df['Prod2'] = np.where(df['Product'] == 2, 1 , 0)
df['Prod3'] = np.where(df['Product'] == 3, 1 , 0)
........
........
Now the above codes works, but I know you shouldn't use for loops since they are slow.
Maybe its because I am new to python, and its really simple but any alternative will be much appreciated. I am also concerned about the Run-time of the code so anything faster than before will be very much helpful...
Empty columns are not necessary here.
Use get_dummies
for new indicator columns, add DataFrame.reindex
for add non exist columns by range
and last DataFrame.add_prefix
:
df1 = (pd.get_dummies(df['Product'])
.reindex(range(1, 8), axis=1, fill_value=0)
.add_prefix('Prod'))
print (df1)
Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0
2 0 0 1 0 0 0 0
3 1 0 0 0 0 0 0
4 0 0 0 1 0 0 0
5 0 0 1 0 0 0 0
6 0 0 0 0 0 1 0
And then is possible use DataFrame.join
with first 2 columns of original DataFrame
:
df2 = df.iloc[:, :2].join(df1)
print (df2)
ID Product Prod1 Prod2 Prod3 Prod4 Prod5 Prod6 Prod7
0 1 1 1 0 0 0 0 0 0
1 1 2 0 1 0 0 0 0 0
2 1 3 0 0 1 0 0 0 0
3 2 1 1 0 0 0 0 0 0
4 2 4 0 0 0 1 0 0 0
5 3 3 0 0 1 0 0 0 0
6 4 6 0 0 0 0 0 1 0