Search code examples
pythonpandasdataframepandas-loc

How to Test multiple columns of pandas for a condition at once and update them


I have a data frame like this:

test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
test

The dataframe looks like:

   col1 col2 col3
0   10  5    6
1   20  10   12
2   30  15   18
3   40  20   24 

I wanna replace the values which are greater than 10 in col2 or col3 with zero. I wanna use loc function for this purpose.
My desired output is:

   col1 col2 col3
0   10  5    6
1   20  10   0
2   30  0    0
3   40  0    0

I have tried the following solution:

cols_to_update = ['col2', 'col3']
test.loc[test[cols_to_update]>10]=0
test

It shows the following error:

KeyError: "None of [Index([('c', 'o', 'l', '1'), ('c', 'o', 'l', '2')], dtype='object')] are in the [index]"

When I use a single column to test the condition, it doesn't show 'KeyError', but now it also replaces values in other two columns.

test.loc[test['col2']>10]=0
test

Output is:

   col1 col2 col3
0   10  5    6
1   0   0    0
2   0   0    0
3   0   0    0
  1. Can we use loc for this purpose?
  2. Why is loc behaving like this?
  3. What is the efficient solution?

Solution

  • I would use numpy.where to conditionally replace values of multiple columns:

    import numpy as np
    
    cols_to_update = ['col2', 'col3']
    test[cols_to_update] = np.where(test[cols_to_update] > 10, 0, test[cols_to_update])
    

    The expression test[cols_to_update] > 10 gives you a boolean mask:

        col2   col3
    0  False  False
    1  False   True
    2   True   True
    3   True   True
    

    Then, np.where picks the value 0 whenever this mask is True or it picks the corresponding original data test[cols_to_update] whenever the mask is False.


    Your solution test.loc[test[cols_to_update]>10]=0 doesn't work because loc in this case would require a boolean 1D series, while test[cols_to_update]>10 is still a DataFrame with two columns. This is also the reason why you cannot use loc for this problem (at least not without looping over the columns): The indices where the values of columns 2 and 3 meet the condition > 10 are different.

    When would loc be appropriate in this case? For example if you wanted to set both columns 2 and 3 to zero when any of the two is greater than 10:

    test.loc[(test[cols_to_update] > 10).any(axis=1), cols_to_update] = 0
    test
    # out:
       col1  col2  col3
    0    10     5     6
    1    20     0     0
    2    30     0     0
    3    40     0     0
    

    In this case you index with a 1D Series ((test[cols_to_update] > 10).any(axis=1)), which is an appropriate use case for loc.