Search code examples
pythonpandasdataframenumpyvectorization

Vectorize a folding process in a dataframe


Suppose we have a sample dataframe like the one below:

df = pd.DataFrame({'A': [np.nan, 0.5, 0.5, 0.5, 0.5], 
                   'B': [np.nan, 3, 4, 1, 2], 
                   'C': [10, np.nan, np.nan, np.nan, np.nan]})
>>> df
     A    B     C
0  NaN  NaN  10.0
1  0.5  3.0   NaN
2  0.5  4.0   NaN
3  0.5  1.0   NaN
4  0.5  2.0   NaN

Col 'D' is calculated with the following operation:

>>> df
     A    B     C     D
0  NaN  NaN  10.0  10.0 
1  0.5  3.0   NaN   8.0 = (10 x 0.5) + 3
2  0.5  4.0   NaN   8.0 = (8 x 0.5) + 4 
3  0.5  1.0   NaN   5.0 = (8 x 0.5) + 1 
4  0.5  2.0   NaN   4.5 = (5 x 0.5) + 2 

Calculating col 'D' reflects a folding process that recalls the previous row of col 'C' and current row of col 'A' and 'B' in each row operation.

I've tried using for loops, functools.reduce() and iterators to do this, but I want to know if there's another method that uses vectorization as much as possible in order to make this operation more efficient in a larger dataset.


Solution

  • I'm not aware of pure vectorized pandas/numpy solution, but you can try to use to speed up the computation:

    from numba import njit
    
    
    @njit
    def calculate(A, B, starting_value=10):
        out = np.empty_like(A, dtype=np.float64)
    
        out[0] = starting_value
    
        for i, (a, b) in enumerate(zip(A[1:], B[1:]), 1):
            out[i] = (out[i - 1] * a) + b
    
        return out
    
    
    df["D"] = calculate(df["A"].values, df["B"].values, 10)
    print(df)
    

    Prints:

         A    B     C     D
    0  NaN  NaN  10.0  10.0
    1  0.5  3.0   NaN   8.0
    2  0.5  4.0   NaN   8.0
    3  0.5  1.0   NaN   5.0
    4  0.5  2.0   NaN   4.5