Suppose we have a sample dataframe like the one below:
df = pd.DataFrame({'A': [np.nan, 0.5, 0.5, 0.5, 0.5],
'B': [np.nan, 3, 4, 1, 2],
'C': [10, np.nan, np.nan, np.nan, np.nan]})
>>> df
A B C
0 NaN NaN 10.0
1 0.5 3.0 NaN
2 0.5 4.0 NaN
3 0.5 1.0 NaN
4 0.5 2.0 NaN
Col 'D' is calculated with the following operation:
>>> df
A B C D
0 NaN NaN 10.0 10.0
1 0.5 3.0 NaN 8.0 = (10 x 0.5) + 3
2 0.5 4.0 NaN 8.0 = (8 x 0.5) + 4
3 0.5 1.0 NaN 5.0 = (8 x 0.5) + 1
4 0.5 2.0 NaN 4.5 = (5 x 0.5) + 2
Calculating col 'D' reflects a folding process that recalls the previous row of col 'C' and current row of col 'A' and 'B' in each row operation.
I've tried using for loops, functools.reduce() and iterators to do this, but I want to know if there's another method that uses vectorization as much as possible in order to make this operation more efficient in a larger dataset.
I'm not aware of pure vectorized pandas/numpy solution, but you can try to use numba to speed up the computation:
from numba import njit
@njit
def calculate(A, B, starting_value=10):
out = np.empty_like(A, dtype=np.float64)
out[0] = starting_value
for i, (a, b) in enumerate(zip(A[1:], B[1:]), 1):
out[i] = (out[i - 1] * a) + b
return out
df["D"] = calculate(df["A"].values, df["B"].values, 10)
print(df)
Prints:
A B C D
0 NaN NaN 10.0 10.0
1 0.5 3.0 NaN 8.0
2 0.5 4.0 NaN 8.0
3 0.5 1.0 NaN 5.0
4 0.5 2.0 NaN 4.5