I have a data frame like this:
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test
The dataframe looks like:
col1 col2 col3
0 10 5 6
1 20 10 12
2 30 15 18
3 40 20 24
I wanna replace the values which are greater than 10 in col2 or col3 with zero. I wanna use loc function for this purpose.
My desired output is:
col1 col2 col3
0 10 5 6
1 20 10 0
2 30 0 0
3 40 0 0
I have tried the following solution:
cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test
It shows the following error:
KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"
When I use a single column to test the condition, it doesn't show 'KeyError', but now it also replaces values in other two columns.
test.loc[test['col2']>10]=0
test
Output is:
col1 col2 col3
0 10 5 6
1 0 0 0
2 0 0 0
3 0 0 0
I would use numpy.where
to conditionally replace values of multiple columns:
import numpy as np
cols_to_update = ['col2', 'col3']
test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])
The expression test[cols_to_update] > 10
gives you a boolean mask:
col2 col3
0 False False
1 False True
2 True True
3 True True
Then, np.where
picks the value 0
whenever this mask is True
or it picks the corresponding original data test[cols_to_update]
whenever the mask is False
.
Your solution test.loc[test[cols_to_update]>10]=0
doesn't work because loc in this case would require a boolean 1D series, while test[cols_to_update]>10
is still a DataFrame with two columns. This is also the reason why you cannot use loc for this problem (at least not without looping over the columns): The indices where the values of columns 2 and 3 meet the condition > 10
are different.
When would loc
be appropriate in this case? For example if you wanted to set both columns 2 and 3 to zero when any of the two is greater than 10:
test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
test
# out:
col1 col2 col3
0 10 5 6
1 20 0 0
2 30 0 0
3 40 0 0
In this case you index with a 1D Series ((test[cols_to_update] > 10).any(axis=1)
), which is an appropriate use case for loc
.