pythonpandasindexingduplicates

Finding the last row that meets conditions of a mask


This is my dataframe:

df = pd.DataFrame({'a': [20, 21, 333, 444], 'b': [20, 20, 20, 20]})

I want to create column c by using this mask:

mask = (df.a >= df.b)

And I want to get the last row that meets this condition and create column c. The output that I want looks like this:

     a   b    c
0   20  20  NaN
1   21  20  NaN
2  333  20  NaN
3  444  20  x

I tried the code below but it didn't work:

df.loc[mask.cumsum().gt(1) & mask, 'c'] = 'x'

Solution

  • For a mask to flag the last value satisfying a condition, use duplicated() by keeping last. We know that mask consists of at most 2 values (True/False). If we can create another mask that flags the last occurrences these values as True, then we can chain it with mask itself for the desired mask. This is accomplished by ~mask.duplicated(keep='last') because mask.duplicated(keep='last') flags duplicates as True except for the last occurrence, so its negation gives us what we want.

    df = pd.DataFrame({'a': [20, 21, 333, 444], 'b': [20, 20, 20, 20]})
    mask = (df.a >= df.b)
    
    df['c'] = pd.Series('x', df.index).where(mask & ~mask.duplicated(keep='last'))
    

    If you want to slice/assign, then you can use this chained mask as well.

    df.loc[mask & ~mask.duplicated(keep='last'), 'c'] = 'x'
    

    A shorter version of @mandy8055's answer is to call idxmax() to get the index of the highest cum sum (although this is showing a FutureWarning on pandas 2.1.0). As pointed out by @mozway, this works as long as there's at least one True value in mask.

    df.loc[mask.cumsum().idxmax(), 'c'] = 'x'
    

    result