Search code examples
pythonpandasdataframegroup-by

How can I find the first row that meets conditions of a mask for each group?


This is my DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'a': ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y', 'y', 'y'],
        'b': [1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2],
        'c': [9, 8, 11, 13, 14, 3, 104, 106, 11, 100, 70, 7]
    }
)

Expected output: Creating column out:

    a  b    c    out
0   x  1    9    NaN
1   x  1    8    NaN
2   x  1   11    NaN
3   x  2   13  found
4   x  2   14    NaN
5   y  1    3    NaN
6   y  1  104  found
7   y  1  106    NaN
8   y  2   11    NaN
9   y  2  100    NaN
10  y  2   70    NaN
11  y  2    7    NaN

The mask is:

mask = (df.c > 10)

The process: Grouping is by column a:

a) For each group, finding the first row that meets the conditions of the mask.

b) For group x this condition only applies when b == 2. That is why row 3 is selected.

And this is my attempt. It is getting close but it feels like this is not the way:

def func(g):
    mask = (g.c > 10)
    g.loc[mask.cumsum().eq(1) & mask, 'out'] = 'found'
    return g

df = df.groupby('a').apply(func)

Solution

  • One option with groupby.idxmax:

    mask = (df['c'] > 10) & (df['a'].ne('x') | df['b'].eq(2))
    
    idx = mask.groupby(df['a']).idxmax()
    df.loc[idx[mask.loc[idx].values], 'out'] = 'found'
    

    Another with groupby.transform:

    mask = (df['c'] > 10) & (df['a'].ne('x') | df['b'].eq(2))
    
    df.loc[mask & mask.groupby(df['a'])
                      .transform(lambda m: (~m).shift(fill_value=True)
                                               .cummin()),
           'out'] = 'found'
    

    Output, with an extra group z that has no match:

        a  b    c    out
    0   x  1    9    NaN
    1   x  1    8    NaN
    2   x  1   11    NaN
    3   x  2   13  found
    4   x  2   14    NaN
    5   y  1    3    NaN
    6   y  1  104  found
    7   y  1  106    NaN
    8   y  2   11    NaN
    9   y  2  100    NaN
    10  y  2   70    NaN
    11  y  2    7    NaN
    12  z  3    1    NaN
    13  z  3    1    NaN
    

    last match

    to get the last match (instead of the first one), just inver the mask:

    Example:

    mask = (df['c'] > 10) & (df['a'].ne('x') | df['b'].eq(2))
    
    mask = mask[::-1]
    
    idx = mask.groupby(df['a']).idxmax()
    df.loc[idx[mask.loc[idx].values], 'out'] = 'found'
    
        a  b    c    out
    0   x  1    9    NaN
    1   x  1    8    NaN
    2   x  1   11    NaN
    3   x  2   13    NaN
    4   x  2   14  found
    5   y  1    3    NaN
    6   y  1  104    NaN
    7   y  1  106    NaN
    8   y  2   11    NaN
    9   y  2  100    NaN
    10  y  2   70  found
    11  y  2    7    NaN
    12  z  3    1    NaN
    13  z  3    1    NaN