Search code examples
pythondataframenumpy

Python recalculate cumulative values of each row of multidimensional data set without looping?


I am working on complex calculations of a dataset that requires recalculating based on the sum of the previous rows. I'll try to articulate as best as I can.

I have an n x n dataset, I need to create a new n x n dataset that takes the value from the first record, subtracts the cumulative sum of that row up to that value, and then multiplies by a value from another 1 x n series.

For example: n x n Dataset :

i 0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 200 0 0 0 0 0 0
3 200 150 0 0 0 0 0
4 200 150 125 0 0 0 0
5 200 150 125 100 0 0 0
6 200 150 125 100 75 0 0

1 x n Series :

i 0
0 .000
1 .001
2 .002
3 .003
4 .004
5 .005
6 .006

Result:

i 0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 .4 [(200 - sum 0) * .002] 0 0 0 0 0
3 .5988 [(200 - sum .4) * .003] .45 [(150 - sum 0) * .003] 0 0 0 0
4 .7960 [(200 - sum (.4, .5988)) * .004] .5982 [(150 - sum .45) * .004] .50 0 0 0
5 .9910 [(200 - sum (.4, .5988, .7960)) * .005] .7447 [(150 - sum .45, .5982) * .005] .6225 .50 0 0
6 1.1832 [(200 - sum (.4, .5988, .7960, .9910)) * .006] .8892 [(150 - sum .45, .5982, .7447) * .006] .74326 .5970 .450 0

I am good at using pandas or numpy to get this data. I am currently looping over each row and column and it is taking a really long time to calculate.

Any help is greatly appreciated.

This is code that works using looping.

import numpy as np
import pandas as pd


df_a =pd.DataFrame([[0,0,0,0,0,0,0],[0,0,0,0,0,0,0],[200,0,0,0,0,0,0],[200,150,0,0,0,0,0],[200,150,125,0,0,0,0],[200,150,125,100,0,0,0],[200,150,125,100,75,0,0]]).astype(float)

df_series = pd.Series([0.0,.001, .002, .003, .004, .005, .006])
df_b = pd.DataFrame(np.arange(49).reshape(7,7)).astype(float)

for i in range(df_a.shape[0]): #iterate over rows
    for j in range(df_a.shape[1]): #iterate over columns
        df_a_val = df_a.iloc[i, j] #source dataset value
        ser_val = df_series.iloc[i] #value of series at row
        df_b_prev = df_b.iloc[:i,j] #get series of previous values
        df_b_sum = df_b_prev.sum() #sum previous values
        
        
        df_b.at[i, j] = (df_a_val-df_b_sum)*(ser_val) #cell based calculation and set value
        print("(df_a_val: {} - df_b_sum: {}) * ser_val : {} = {}".format(df_a_val, df_b_sum, ser_val, df_b.at[i, j]))
print(df_series)
print(df_b)

Solution

  • You could use shift after broadcasted multiplication

    a = (df_a > 0) * df_series.to_numpy()[:,None]
    df_a * a * (1 - a).cumprod(0).shift(fill_value=0)
    
              0         1         2      3     4    5    6
    0  0.000000  0.000000  0.000000  0.000  0.00  0.0  0.0
    1  0.000000  0.000000  0.000000  0.000  0.00  0.0  0.0
    2  0.400000  0.000000  0.000000  0.000  0.00  0.0  0.0
    3  0.598800  0.450000  0.000000  0.000  0.00  0.0  0.0
    4  0.796005  0.598200  0.500000  0.000  0.00  0.0  0.0
    5  0.991026  0.744759  0.622500  0.500  0.00  0.0  0.0
    6  1.183285  0.889242  0.743265  0.597  0.45  0.0  0.0