Search code examples
pythonpandasdataframegroup-by

What is the best way to filter groups by conditionally checking the values of the first row of each group only?


This is my DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'group': list('xxxxyyy'),
        'open': [100, 150, 200, 160, 300, 150, 170],
        'close': [105, 150, 200, 160, 350, 150, 170],
        'stop': [104, 104, 104, 104, 400, 400, 400]
    }
)

Expected output is returning group x based on the group column:

  group  open  close  stop
0     x   100    105   104
1     x   150    150   104
2     x   200    200   104
3     x   160    160   104

Logic:

I want to check if df.stop.iloc[0] for each group is between df.open.iloc[0] and df.close.iloc[0]. And if it is between these two, I want to return that entire group.

This is my attempt. It works but I think there is a better way to do it. Note that in the if clause, both conditions are needed to be checked.

def func(df):
    s = df.stop.iloc[0]
    o = df.open.iloc[0]
    c = df.close.iloc[0]

    if (o <= s <= c) or (c <= s <= o):
        return df

out = df.groupby('group').apply(func).reset_index(drop=True)

Solution

  • You could groupby.first, then build a mask with isin:

    tmp = df.groupby('group').first()
    keep = tmp.index[tmp['stop'].between(tmp['open'], tmp['close'])
                    |tmp['stop'].between(tmp['close'], tmp['open'])
                    ]
    df[df['group'].isin(keep)]
    

    Output:

      group  open  close  stop
    0     x   100    105   104
    1     x   150    150   104
    2     x   200    200   104
    3     x   160    160   104
    

    Intermediate tmp:

           open  close  stop  o<=s<=c  c<=s<=o     OR
    group                                            
    x       100    105   104     True    False   True
    y       300    350   400    False    False  False
    

    with groupby.filter:

    def func(df):
        s = df.stop.iloc[0]
        o = df.open.iloc[0]
        c = df.close.iloc[0]
    
        return (o <= s <= c) or (c <= s <= o)
    
    df.groupby('group').filter(func)
    

    separate groups:

    g = df.groupby('group')
    tmp = g.first()
    
    keep = tmp.index[tmp['stop'].between(tmp['open'], tmp['close'])
                    |tmp['stop'].between(tmp['close'], tmp['open'])
                    ]
    
    for x in keep:
        print(f'group {x}')
        print(g.get_group(x))
    

    Output:

    group x
      group  open  close  stop
    0     x   100    105   104
    1     x   150    150   104
    2     x   200    200   104
    3     x   160    160   104
    

    without groupby

    For this particular case, you can even skip the groupby and replace it with drop_duplicates:

    tmp = df.drop_duplicates('group')
    keep = tmp.loc[tmp['stop'].between(tmp['open'], tmp['close'])
                  |tmp['stop'].between(tmp['close'], tmp['open']),
                   'group']
    
    out = df[df['group'].isin(keep)]
    

    timings

    Tested on ~ 7K rows with groups of 3 rows

    # filter
    266 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # groupby.first + isin
    1.63 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    # drop_duplicates + isin
    1.34 ms ± 34.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)