What is the best way to filter groups by conditionally checking the values of the first row of each group only?

This is my DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'group': list('xxxxyyy'),
        'open': [100, 150, 200, 160, 300, 150, 170],
        'close': [105, 150, 200, 160, 350, 150, 170],
        'stop': [104, 104, 104, 104, 400, 400, 400]
    }
)

Expected output is returning group x based on the group column:

  group  open  close  stop
0     x   100    105   104
1     x   150    150   104
2     x   200    200   104
3     x   160    160   104

Logic:

I want to check if df.stop.iloc[0] for each group is between df.open.iloc[0] and df.close.iloc[0]. And if it is between these two, I want to return that entire group.

This is my attempt. It works but I think there is a better way to do it. Note that in the if clause, both conditions are needed to be checked.

def func(df):
    s = df.stop.iloc[0]
    o = df.open.iloc[0]
    c = df.close.iloc[0]

    if (o <= s <= c) or (c <= s <= o):
        return df

out = df.groupby('group').apply(func).reset_index(drop=True)

Solution

You could groupby.first, then build a mask with isin:

tmp = df.groupby('group').first()
keep = tmp.index[tmp['stop'].between(tmp['open'], tmp['close'])
                |tmp['stop'].between(tmp['close'], tmp['open'])
                ]
df[df['group'].isin(keep)]

Output:

  group  open  close  stop
0     x   100    105   104
1     x   150    150   104
2     x   200    200   104
3     x   160    160   104

Intermediate tmp:

       open  close  stop  o<=s<=c  c<=s<=o     OR
group                                            
x       100    105   104     True    False   True
y       300    350   400    False    False  False

with `groupby.filter`:

def func(df):
    s = df.stop.iloc[0]
    o = df.open.iloc[0]
    c = df.close.iloc[0]

    return (o <= s <= c) or (c <= s <= o)

df.groupby('group').filter(func)

separate groups:

g = df.groupby('group')
tmp = g.first()

keep = tmp.index[tmp['stop'].between(tmp['open'], tmp['close'])
                |tmp['stop'].between(tmp['close'], tmp['open'])
                ]

for x in keep:
    print(f'group {x}')
    print(g.get_group(x))

Output:

group x
  group  open  close  stop
0     x   100    105   104
1     x   150    150   104
2     x   200    200   104
3     x   160    160   104

without groupby

For this particular case, you can even skip the groupby and replace it with drop_duplicates:

tmp = df.drop_duplicates('group')
keep = tmp.loc[tmp['stop'].between(tmp['open'], tmp['close'])
              |tmp['stop'].between(tmp['close'], tmp['open']),
               'group']

out = df[df['group'].isin(keep)]

timings

Tested on ~ 7K rows with groups of 3 rows

# filter
266 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# groupby.first + isin
1.63 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# drop_duplicates + isin
1.34 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

What is the best way to filter groups by conditionally checking the values of the first row of each group only?

with groupby.filter:

separate groups:

without groupby

timings

with `groupby.filter`: