I've got an ordered dataframe which I'm trying to aggregate by some grouping columns and based on accumulated previous values of other columns.
df = pd.DataFrame({'ID':['ID1','ID1','ID1','ID1','ID1','ID2','ID2','ID2','ID2']
, 'Group':['Group1','Group2','Group2','Group2','Group1','Group2','Group2','Group2','Group1']
, 'Value1':[0,1,1,1,1,1,0,0,0]
, 'Value2':[1,2,3,4,5,4,3,2,2]})
df
ID Group Value1 Value2
0 ID1 Group1 0 1
1 ID1 Group2 1 2
2 ID1 Group2 1 3
3 ID1 Group2 1 4
4 ID1 Group1 1 5
5 ID2 Group2 1 4
6 ID2 Group2 0 3
7 ID2 Group2 0 2
8 ID2 Group1 0 2
I'd like to aggregate three different ways using Value1 and Value 2, Grouped by ID and Group. df is already ordered (based on date, ID and Group)
Output1: count the number of 1s in previous rows of Value1, by ID and Group (excluding the row itself)
Output2: sum the value of previous rows of Value2, by ID and Group (including the row itself)
Output3: sum Value2 of previous rows, by ID and Group, if Value1 of those previous rows is 1 (excluding the row itself)
here's my desired output:
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2
3 ID1 Group2 1 4 2 9 5
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4
7 ID2 Group2 0 2 1 9 4
8 ID2 Group1 0 2 0 2 NaN
To make sure it's clear what I'm trying to do, let's look at the output index 3 (the fourtth row)
3 ID1 Group2 1 4 2 9 5
Output1 = 2 because there are two rows above it in ID1/Group2 that has Value1 = 1.
Output2 = 9 because the sum of Value2 of all rows above it in ID1/Group2, including the row itself is (2+3+4 = 9).
Output3 = 5, because there are two previous rows in ID1/Group2 that have Value1 = 1, so some of their Value2 (2 + 3 = 5)
I'd like to add I'm working on a large dataset, so I'm looking for an efficient/high performance solution.
groupby + cumsum
Value2
where the corresponding value in column Value1 is 0, after that you need to group the masked column and use cumsum
to calculate cumulative sum now in order to exclude the current row you have can subtract the masked column from the cummulative sumg = df.groupby(['ID', 'Group'])
df['Output1'] = g['Value1'].cumsum() - df['Value1']
df['Output2'] = g['Value2'].cumsum()
s = df['Value2'].mul(df['Value1'])
df['Output3'] = s.groupby([df['ID'], df['Group']]).cumsum() - s
Update as per the new requirements in the comments:
def transform(g):
g['Output1'] = g['Value1'].cumsum() - g['Value1']
g['Output2'] = g['Value2'].cumsum()
cond = g['Value1'].eq(1)
g['Output3'] = g['Value2'].mask(~cond).cumsum().shift().ffill()
return g
df.groupby(['ID', 'Group']).apply(transform)
Result
print(df)
ID Group Value1 Value2 Output1 Output2 Output3
0 ID1 Group1 0 1 0 1 NaN
1 ID1 Group2 1 2 0 2 NaN
2 ID1 Group2 1 3 1 5 2.0
3 ID1 Group2 1 4 2 9 5.0
4 ID1 Group1 1 5 0 6 NaN
5 ID2 Group2 1 4 0 4 NaN
6 ID2 Group2 0 3 1 7 4.0
7 ID2 Group2 0 2 1 9 4.0
8 ID2 Group1 0 2 0 2 NaN