python pandas dataframe optimization vectorization

How to Fill Pandas Column with calculations involving cells from previous rows without Using for loop

The cells highlighted in Green are the calculations that need to be done in pandas dataframe without the application for for loop.

i tried the below code. But it is giving wrong values in column "Val".

The Obtained output from the code is

DataFrame: df Par Val 0 50 50.0 1 60 57.0 2 70 67.0 3 80 77.0 4 90 87.0 5 100 97.0

Need help in getting the values as shown in image without loop application. [![enter image description here][2]][2]

[![enter image description here][2]][2]

df = pd.DataFrame()
df['Par'] = [50,60,70,80,90,100]
k = df['Par'].shift(1).fillna(df['Par'][0]).astype(float)

df['Val'] = (df['Par']*0.7) + (k*0.3)

print("DataFrame: df\n",df)`

Solution

Recursive calculations are not vectorisable, for improve performance is used numba:

from numba import jit

@jit(nopython=True)
def f(a):
    d = np.empty(a.shape)
    d[0] = a[0]
    for i in range(1, a.shape[0]):
        d[i] = d[i-1] * 0.3 + a[i] * 0.7
    return d

df['Val'] = f(df['Par'].to_numpy())
print (df)
   Par      Val
0   50  50.0000
1   60  57.0000
2   70  66.1000
3   80  75.8300
4   90  85.7490
5  100  95.7247

Difference for performance for 1k rows:

from numba import jit
import itertools

np.random.seed(2022)

df = pd.DataFrame({'Par': np.random.randint(100, size=1000)})


In [64]: %%timeit
    ...: 
    ...: df['Val1'] = f(df['Par'].to_numpy())
    ...: 
    ...: import itertools
    ...: 
    ...: df.loc[0,"Val"] = df.loc[0,"Par"]
    ...: for _ in itertools.repeat(None, len(df)):
    ...:     df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
    ...:     
1.05 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [65]: %%timeit
    ...: @jit(nopython=True)
    ...: def f(a):
    ...:     d = np.empty(a.shape)
    ...:     d[0] = a[0]
    ...:     for i in range(1, a.shape[0]):
    ...:         d[i] = d[i-1] * 0.3 + a[i] * 0.7
    ...:     return d
    ...: 
    ...: df['Val1'] = f(df['Par'].to_numpy())
    ...: 
121 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Test for 100krows:

np.random.seed(2022)
df = pd.DataFrame({'Par': np.random.randint(100, size=100000)})


In [70]: %%timeit
    ...: 
    ...: df['Val1'] = f(df['Par'].to_numpy())
    ...: 
    ...: import itertools
    ...: 
    ...: df.loc[0,"Val"] = df.loc[0,"Par"]
    ...: for _ in itertools.repeat(None, len(df)):
    ...:     df["Val"] = df["Val"].fillna((df["Par"]*0.7)+(df["Val"].shift(1)*(0.3)))
    ...:     
4min 47s ± 5.39 s per loop (mean ± std. dev. of 7 runs, 1 loop each)



In [71]: %%timeit
    ...: @jit(nopython=True)
    ...: def f(a):
    ...:     d = np.empty(a.shape)
    ...:     d[0] = a[0]
    ...:     for i in range(1, a.shape[0]):
    ...:         d[i] = d[i-1] * 0.3 + a[i] * 0.7
    ...:     return d
    ...: 
    ...: df['Val1'] = f(df['Par'].to_numpy())
    ...: 
    ...: 
129 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)