Search code examples
pythonpandasperformancefor-loopapply

Python - Efficient calculation where end value of one row is the start value of another row


I would like to make simple calculations on a rolling basis, but have heavy performance issues when I try to solve this with a nested for-loop. I need to perform this kind of operations on very large data, but have to use standard Python (incl. Pandas). The values are floats and can be negative, zero or positive.

I have a pd.DataFrame (df1) which contains (structured by some dimensions, lets call them key1 and key2) a start column, a end column and some operations-columns in between, which are supposed to be used to calculate the end column based on the start column.

Basically, the simple logic is: start + plus - minus = end, where the end value of each row is the start value of the next row.

This would need to be done by the two keys, i.e. for AX, AY and BX seperately.

df2 shows the desired result, but I don't know how to get there in an efficient way without blowing up my memory if this task is done on much larger tables.

import pandas as pd 
import numpy as np

df1 = pd.DataFrame(np.array([["A", "X", 3,6,4,0], ["A", "X", 0,2,10,0], ["A", "X", 0,9,3,0], ["A", "Y", 8,3,1,0], ["A", "Y", 0,2,3,0], ["B", "X", 4,4,2,0], ["B", "X", 0,1,0,0]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df1
  key1 key2 start plus minus end
0    A    X     3    6     4   0
1    A    X     0    2    10   0
2    A    X     0    9     3   0
3    A    Y     8    3     1   0
4    A    Y     0    2     3   0
5    B    X     4    4     2   0
6    B    X     0    1     0   0
    

df2 = pd.DataFrame(np.array([["A", "X", 3,6,4,5], ["A", "X", 5,2,10,-3], ["A", "X", -3,9,3,3], ["A", "Y", 8,3,1,10], ["A", "Y", 10,2,3,9], ["B", "X", 4,4,2,2], ["B", "X", 2,1,0,3]]),
                   columns=['key1', 'key2', 'start', 'plus', 'minus', 'end'])

>>> df2
  key1 key2 start plus minus end
0    A    X     3    6     4   5
1    A    X     5    2    10  -3
2    A    X    -3    9     3   3
3    A    Y     8    3     1  10
4    A    Y    10    2     3   9
5    B    X     4    4     2   2
6    B    X     2    1     0   3

Solution

  • You can create helper Series for subtract plus and minus columns, create cumulative sums per groups by both columns and add first value of start for final end column, then for start column use DataFrameGroupBy.shift with replace first value by original values in Series.fillna:

    plusminus = df1['plus'].sub(df1['minus'])
    df1 = df1.assign(plusminus = plusminus)
    g = df1.groupby(['key1','key2'])
    
    df1['end'] = g['plusminus'].cumsum().add(g['start'].transform('first'))
    df1['start'] = g['end'].shift().fillna(df1['start']).astype(int)
    df1 = df1.drop('plusminus', axis=1)
    print (df1)
      key1 key2  start  plus  minus  end
    0    A    X      3     6      4    5
    1    A    X      5     2      1    6
    2    A    X      6     5      7    4
    3    A    Y      8     3      1   10
    4    A    Y     10     2      3    9
    5    B    X      4     4      2    6
    6    B    X      6     1      0    7