python pandas dataframe numpy vectorization

Vectorize a folding process in a dataframe

Suppose we have a sample dataframe like the one below:

df = pd.DataFrame({'A': [np.nan, 0.5, 0.5, 0.5, 0.5], 
                   'B': [np.nan, 3, 4, 1, 2], 
                   'C': [10, np.nan, np.nan, np.nan, np.nan]})
>>> df
     A    B     C
0  NaN  NaN  10.0
1  0.5  3.0   NaN
2  0.5  4.0   NaN
3  0.5  1.0   NaN
4  0.5  2.0   NaN

Col 'D' is calculated with the following operation:

>>> df
     A    B     C     D
0  NaN  NaN  10.0  10.0 
1  0.5  3.0   NaN   8.0 = (10 x 0.5) + 3
2  0.5  4.0   NaN   8.0 = (8 x 0.5) + 4 
3  0.5  1.0   NaN   5.0 = (8 x 0.5) + 1 
4  0.5  2.0   NaN   4.5 = (5 x 0.5) + 2

Calculating col 'D' reflects a folding process that recalls the previous row of col 'C' and current row of col 'A' and 'B' in each row operation.

I've tried using for loops, functools.reduce() and iterators to do this, but I want to know if there's another method that uses vectorization as much as possible in order to make this operation more efficient in a larger dataset.

Solution

I'm not aware of pure vectorized pandas/numpy solution, but you can try to use numba to speed up the computation:

from numba import njit


@njit
def calculate(A, B, starting_value=10):
    out = np.empty_like(A, dtype=np.float64)

    out[0] = starting_value

    for i, (a, b) in enumerate(zip(A[1:], B[1:]), 1):
        out[i] = (out[i - 1] * a) + b

    return out


df["D"] = calculate(df["A"].values, df["B"].values, 10)
print(df)

Prints:

     A    B     C     D
0  NaN  NaN  10.0  10.0
1  0.5  3.0   NaN   8.0
2  0.5  4.0   NaN   8.0
3  0.5  1.0   NaN   5.0
4  0.5  2.0   NaN   4.5