Search code examples
pythonpandasdataframeoptimization

Best way to avoid a loop


I have 2 dataframes of number x and y of same length and an input number a. I would like to find the fastest way to calculate a third list z such as :

  • z[0] = a
  • z[i] = z[i-1]*(1+x[i]) + y[i]

without using a loop like that :

a = 213
x = pd.DataFrame({'RandomNumber': np.random.rand(200)})
y = pd.DataFrame({'RandomNumber': np.random.rand(200)})
z = pd.Series(index=x.index, dtype=float)
z[0] = a
for i in range(1,len(x.index)):
    z[i] = z[i-1]*(1+x.iloc[i]) + y.iloc[i]

Solution

  • You can't really vectorize this function, the development of the operation becomes too complex.

    For instance, z[i+1] expressed in function of z[i-1] is equal to:

    z[i-1]*(1+x[i])+y[i] + z[i]*(x[i+1]+x[i]*x[i+1]) + y[i]*x[i] + y[i+1]
    

    And this get worse for each step

    As suggested in comment, if speed is a concern, you could use numba:

    from numba import jit
    
    @jit(nopython=True)
    def f(a, x, y):
        out = [a]
        for i in range(1, len(x)):
            out.append(out[i-1] * (1 + x[i]) + y[i])
        return out
    
    out = pd.Series(f(213, x['RandomNumber'].to_numpy(), y['RandomNumber'].to_numpy()), index=x.index)
    

    Output (using np.random.seed(0) and with 5 rows):

    0     213.000000
    1     365.772922
    2     587.139217
    3     908.025165
    4    1293.097825
    dtype: float64
    

    Timings (200 rows):

    # python loop
    59.8 ms ± 908 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # numba
    151 µs ± 968 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)