Search code examples
pythonpandasdataframe

How can I filter groups by comparing the first value of each group and the last cummax that changes conditionally?


My DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'group': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e'],
        'num': [1, 2, 3, 1, 12, 12, 13, 2, 4, 2, 5, 6, 10, 20, 30]
    }
)

Expected output is getting three groups from above df

  group  num
0      a    1
1      a    2
2      a    3

   group  num
6      c   13
7      c    2
8      c    4

   group  num
12     e   10
13     e   20
14     e   30

Logic:

I want to compare the first value of each group to the last cummax of num column. I can explain better by this code:

df['last_num'] = df.groupby('group')['num'].tail(1)
df['last_num'] = df.last_num.ffill().cummax()

But I think what I really need is this desired_cummax:

   group  num  last_num   desired_cummax
0      a    1       NaN    3
1      a    2       NaN    3
2      a    3       3.0    3
3      b    1       3.0    3
4      b   12       3.0    3
5      b   12      12.0    3 
6      c   13      12.0    3
7      c    2      12.0    3
8      c    4      12.0    4
9      d    2      12.0    4
10     d    5      12.0    4
11     d    6      12.0    4
12     e   10      12.0    4
13     e   20      12.0    4
14     e   30      30.0    30

I don't want a new cummax if the first value of num for each group is less than last_num.

For example for group b, the first value of num is 1. Since it is less that its last_num, when it reaches the end of the group b it should not put 12. It should still be 3.

Now for group c, since its first value is more than last_num, when it reaches at the end of group c, a new cummax will be set.

After that I want to filter the groups. If df.num.iloc[0] > df.desired_cummax.iloc[0]

Note that the first group should be in the expected output no matter what.

Maybe there is a better approach to solve this. But this is what I have thought might work.

My attempt was creating last_num but I don't know how to continue.


Solution

  • IIUC, you can aggregate as first/last per group, mask the unwanted values and map back to the group. Finally shift one row up:

    tmp = df.groupby('group')['num'].agg(['first', 'last'])
    
    s = tmp['last'].where(tmp['last'].shift(fill_value=0).le(tmp['first'])).ffill().cummax()
    
    df['desired_cummax'] = df['group'].map(s.shift().bfill()).shift(-1).fillna(df['num'])
    

    Output:

       group  num  desired_cummax
    0      a    1             3.0
    1      a    2             3.0
    2      a    3             3.0
    3      b    1             3.0
    4      b   12             3.0
    5      b   12             3.0
    6      c   13             3.0
    7      c    2             3.0
    8      c    4             4.0
    9      d    2             4.0
    10     d    5             4.0
    11     d    6             4.0
    12     e   10             4.0
    13     e   20             4.0
    14     e   30            30.0
    

    Intermediates:

    # computation of the mapping Series "s"
           first  last  last.shift(fill_value=0)  .le(tmp['first'])  where  .ffill()
    group                                                                           
    a          1     3                         0               True    3.0       3.0
    b          1    12                         3              False    NaN       3.0
    c         13     4                        12               True    4.0       4.0
    d          2     6                         4              False    NaN       4.0
    e         10    30                         6               True   30.0      30.0
    
    # shifting before mapping
              s  s.shift()  .bfill()
    group                           
    a       3.0        NaN       3.0
    b       3.0        3.0       3.0
    c       4.0        3.0       3.0
    d       4.0        4.0       4.0
    e      30.0        4.0       4.0
    
    # mapping
       group  map  .shift(-1)  .fillna(df['num'])
    0      a  3.0         3.0                 3.0
    1      a  3.0         3.0                 3.0
    2      a  3.0         3.0                 3.0
    3      b  3.0         3.0                 3.0
    4      b  3.0         3.0                 3.0
    5      b  3.0         3.0                 3.0
    6      c  3.0         3.0                 3.0
    7      c  3.0         3.0                 3.0
    8      c  3.0         4.0                 4.0
    9      d  4.0         4.0                 4.0
    10     d  4.0         4.0                 4.0
    11     d  4.0         4.0                 4.0
    12     e  4.0         4.0                 4.0
    13     e  4.0         4.0                 4.0
    14     e  4.0         NaN                30.0