If I have a pandas dataframe with the following columns: id, num, amount.
I want to group the dataframe such that all rows in each group have the same id and amount and where each row's value of num has a value that is not more than 10 larger or smaller the next row's value of num.
For the same id, if one row to the next does not have the same amount or if the absolute difference between the two num values is more than 10 then it will start a new grouping. Having a row with a different id in the middle does not break a grouping.
How can I go about doing this?
I have not managed to make a grouping where I'm not looking for matching values (like here where I need it to be close - but not matching). I'm assuming that this would need some custom grouping function but I've been having trouble putting one together
Example dataframe:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 12 |
aaa-aaa | 130 | 39 |
bbb-bbb | 270 | 41 |
ccc-ccc | 130 | 19 |
bbb-bbb | 270 | 37 |
aaa-aaa | 130 | 42 |
aaa-aaa | 380 | 39 |
Expected Groups:
Group 1:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 12 |
Group 2:
id | amount | num |
---|---|---|
aaa-aaa | 130 | 39 |
aaa-aaa | 130 | 42 |
Group 3:
id | amount | num |
---|---|---|
bbb-bbb | 270 | 41 |
bbb-bbb | 270 | 37 |
Group 4:
id | amount | num |
---|---|---|
ccc-ccc | 130 | 19 |
Group 5:
id | amount | num |
---|---|---|
aaa-aaa | 380 | 39 |
The logic is not fully clear, but assuming you want to start a new group when there is a gap of more than 10:
close = (df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs().gt(10).cumsum()
)
for _, g in df.groupby(['amount', close]):
print(g, end='\n\n')
Output:
id amount num
0 aaa-aaa 130 12
3 ddd-ddd 130 19
id amount num
1 bbb-bbb 130 39
id amount num
2 ccc-ccc 270 41
4 eee-eee 270 37
# sort values by amount/sum
df.sort_values(by=['amount', 'num'])
id amount num
0 aaa-aaa 130 12
3 ccc-ccc 130 19
1 aaa-aaa 130 39
5 aaa-aaa 130 42
4 bbb-bbb 270 37
2 bbb-bbb 270 41
6 aaa-aaa 380 39
# get the absolute successive difference in "num"
(df.sort_values(by=['amount', 'num'])
.groupby('amount')
['num'].diff().abs()
)
0 NaN
3 7.0
1 20.0
5 3.0
4 NaN
2 4.0
6 NaN
Name: num, dtype: float64
# check if it's greater than 10 and cumsum
# to create a grouper for groupby
[...].gt(10).cumsum()
0 0
3 0
1 1
5 1
4 1
2 1
6 1
Name: num, dtype: int64