I would like to make simple calculations on a rolling basis, but have heavy performance issues when I try to solve this with a nested for-loop. I need to perform this kind of operations on very large data, but have to use standard Python (incl. Pandas). The values are floats and can be negative, zero or positive.
I have a pd.DataFrame (df1) which contains (structured by some dimensions, lets call them key1 and key2) a start column, a end column and some operations-columns in between, which are supposed to be used to calculate the end column based on the start column.
Basically, the simple logic is: start + plus - minus = end, where the end value of each row is the start value of the next row.
This would need to be done by the two keys, i.e. for AX, AY and BX seperately.
df2 shows the desired result, but I don't know how to get there in an efficient way without blowing up my memory if this task is done on much larger tables.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,10,0], ["A", "X", 0,9,3,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])
>>> df1
key1 key2 start plus minus end
0 A X 3 6 4 0
1 A X 0 2 10 0
2 A X 0 9 3 0
3 A Y 8 3 1 0
4 A Y 0 2 3 0
5 B X 4 4 2 0
6 B X 0 1 0 0
df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,10,-3], ["A", "X", -3,9,3,3], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 4,4,2,2], ["B", "X", 2,1,0,3]]),
columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])
>>> df2
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 5 2 10 -3
2 A X -3 9 3 3
3 A Y 8 3 1 10
4 A Y 10 2 3 9
5 B X 4 4 2 2
6 B X 2 1 0 3
You can create helper Series for subtract plus
and minus
columns, create cumulative sums per groups by both columns and add first value of start
for final end
column, then for start
column use DataFrameGroupBy.shift
with replace first value by original values in Series.fillna
:
plusminus = df1['plus'].sub(df1['minus'])
df1 = df1.assign(plusminus = plusminus)
g = df1.groupby(['key1','key2'])
df1['end'] = g['plusminus'].cumsum().add(g['start'].transform('first'))
df1['start'] = g['end'].shift().fillna(df1['start']).astype(int)
df1 = df1.drop('plusminus', axis=1)
print (df1)
key1 key2 start plus minus end
0 A X 3 6 4 5
1 A X 5 2 1 6
2 A X 6 5 7 4
3 A Y 8 3 1 10
4 A Y 10 2 3 9
5 B X 4 4 2 6
6 B X 6 1 0 7