My DataFrame:
import pandas as pd
df = pd.DataFrame(
{
'group': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e'],
'num': [1, 2, 3, 1, 12, 12, 13, 2, 4, 2, 5, 6, 10, 20, 30]
}
)
Expected output is getting three groups from above df
group num
0 a 1
1 a 2
2 a 3
group num
6 c 13
7 c 2
8 c 4
group num
12 e 10
13 e 20
14 e 30
Logic:
I want to compare the first value of each group to the last cummax
of num
column. I can explain better by this code:
df['last_num'] = df.groupby('group')['num'].tail(1)
df['last_num'] = df.last_num.ffill().cummax()
But I think what I really need is this desired_cummax
:
group num last_num desired_cummax
0 a 1 NaN 3
1 a 2 NaN 3
2 a 3 3.0 3
3 b 1 3.0 3
4 b 12 3.0 3
5 b 12 12.0 3
6 c 13 12.0 3
7 c 2 12.0 3
8 c 4 12.0 4
9 d 2 12.0 4
10 d 5 12.0 4
11 d 6 12.0 4
12 e 10 12.0 4
13 e 20 12.0 4
14 e 30 30.0 30
I don't want a new cummax
if the first value of num
for each group is less than last_num
.
For example for group b
, the first value of num
is 1. Since it is less that its last_num
, when it reaches the end of the group b
it should not put 12. It should still be 3.
Now for group c
, since its first value is more than last_num
, when it reaches at the end of group c
, a new cummax
will be set.
After that I want to filter the groups. If df.num.iloc[0] > df.desired_cummax.iloc[0]
Note that the first group should be in the expected output no matter what.
Maybe there is a better approach to solve this. But this is what I have thought might work.
My attempt was creating last_num
but I don't know how to continue.
IIUC, you can aggregate as first/last per group, mask the unwanted values and map
back to the group. Finally shift
one row up:
tmp = df.groupby('group')['num'].agg(['first', 'last'])
s = tmp['last'].where(tmp['last'].shift(fill_value=0).le(tmp['first'])).ffill().cummax()
df['desired_cummax'] = df['group'].map(s.shift().bfill()).shift(-1).fillna(df['num'])
Output:
group num desired_cummax
0 a 1 3.0
1 a 2 3.0
2 a 3 3.0
3 b 1 3.0
4 b 12 3.0
5 b 12 3.0
6 c 13 3.0
7 c 2 3.0
8 c 4 4.0
9 d 2 4.0
10 d 5 4.0
11 d 6 4.0
12 e 10 4.0
13 e 20 4.0
14 e 30 30.0
Intermediates:
# computation of the mapping Series "s"
first last last.shift(fill_value=0) .le(tmp['first']) where .ffill()
group
a 1 3 0 True 3.0 3.0
b 1 12 3 False NaN 3.0
c 13 4 12 True 4.0 4.0
d 2 6 4 False NaN 4.0
e 10 30 6 True 30.0 30.0
# shifting before mapping
s s.shift() .bfill()
group
a 3.0 NaN 3.0
b 3.0 3.0 3.0
c 4.0 3.0 3.0
d 4.0 4.0 4.0
e 30.0 4.0 4.0
# mapping
group map .shift(-1) .fillna(df['num'])
0 a 3.0 3.0 3.0
1 a 3.0 3.0 3.0
2 a 3.0 3.0 3.0
3 b 3.0 3.0 3.0
4 b 3.0 3.0 3.0
5 b 3.0 3.0 3.0
6 c 3.0 3.0 3.0
7 c 3.0 3.0 3.0
8 c 3.0 4.0 4.0
9 d 4.0 4.0 4.0
10 d 4.0 4.0 4.0
11 d 4.0 4.0 4.0
12 e 4.0 4.0 4.0
13 e 4.0 4.0 4.0
14 e 4.0 NaN 30.0