Search code examples
pythonpandas

Should we pre-calculate scalar calculations before we apply them to dataframe columns?


Just curious if option (b) is more efficient than option (a)? At the first glance, option (a) will have several times of more operations than option (b). But I did some simulations for a million rows in df, option (b) is just a fraction faster on average. Does it mean the Pandas will group all the scalar operations in option (a) automatically?

(a) Variable a, b, c, d, e, f are all scalars.

    df['val2'] = (a*b+c*d)*df['val1']*e/f

(b)

    x = (a*b+c*d)*e/f
    df['val2'] = df['val1']*x

Solution

  • Yes, it is better to pre-compute x. Actually what matters is the operator precedence and the order in which the operations are performed.

    Assuming s your Series, when you run (a*b+c*d)*s*e/f you perform two multiplications and one division of the full Series. If you pre-compute or use (a*b+c*d)*e/f*s, then there is only one operation involving the Series.

    Example:

    %timeit x*s
    1.19 ms ± 73.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit (a*b+c*d)*s*e/f
    3.45 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit s*(a*b+c*d)*e/f
    3.63 ms ± 84.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # now let's force the scalar operation to be grouped
    %timeit s*((a*b+c*d)*e/f)
    1.21 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit (a*b+c*d)*e/f*s
    1.14 ms ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    

    Setup:

    s = pd.Series(np.arange(1_000_000))
    a=b=c=d=e=f=2
    x = (a*b+c*d)*e/f
    

    In the initial (a*b+c*d)*df['val1']*e/f, the order or the operations is:

    a*b       # ab      #
    c*d       # cd      # scalars
    ab + cd   # abcd    #
    s * abcd  # sabcd      #
    e * sabcd # esabcd     # Series
    esabcd / f             #