Search code examples
pythonpandasboolean-expressionhierarchical

How to make assignment to hierarchical column of pandas dataframe using boolean mask?


I have a dataframe like this:

import pandas as pd
df = pd.DataFrame({
    "time": [1, 2, 1, 2], 
    "site": ['a', 'a', 'b', 'b'], 
    "val1": [11, 12, 21, 22],
    "val2": [101, 102, 201, 202]
})
df.set_index(['time', 'site'], inplace=True, append=False)
df = df.unstack("site")
print df

     val1     val2     
site    a   b    a    b
time                   
1      11  21  101  201
2      12  22  102  202

I would like change some values that match a boolean filter. e.g.:

ix = df.val1 > 20
print ix

site      a     b
time             
1     False  True
2     False  True

A natural thing to try would be df.val1[ix] = 50. This does the expected assignment, but gives a warning: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead.

So now I am trying to achieve something similar using df.loc. But I can't find any way to use df.loc with this kind of boolean mask. This seems to be because I am using hierarchical columns, i.e., I don't have much trouble if I only have one set of values (val1). Unfortunately, assignments with boolean filters on hierarchical columns aren't covered very well in the docs.

I've tried referring to df.loc[:,'val1',ix], but that gives IndexingError: Too many indexers. I've tried df.loc[:,'val1'][ix] = 50, and that works but gives the SettingWithCopyWarning.

I can use df.val1 = df.val1.where(~ix, other=50) but this seems unintuitive, inefficient and inflexible (e.g., it couldn't easily be extended to add 10 to the existing values).

Is there some other indexing method I should use to assign values to a hierarchical column of a dataframe, based on a boolean mask?

Edited to extend the question:

I didn't realize this would be an issue, but I'd actually like to filter based on values in both the val1 and val2 columns and change values in both sets of columns, something like this:

ix = (df.val1 > 20) | (df.val2 < 102)
df.val1[ix] = 50
df.val2[ix] = 150

Is there a simple indexing approach that can do this? It's pretty easy with numpy ndarrays, but seems to be much trickier with a pandas dataframe.


Solution

  • you can just use a list to select you column

    idx = df[['val1']] > 20
    
    idx
    Out[39]: 
           val1      
    site      a     b
    time             
    1     False  True
    2     False  True
    
    df[idx] = 50
    
    df
    Out[41]: 
         val1     val2     
    site    a   b    a    b
    time                   
    1      11  50  101  201
    2      12  50  102  202