ufunc memory consumption in arithemtic expressions

What is the memory consumption for arithmetic numpy expressions I.e.

vec ** 3 + vec ** 2 + vec

(vec being a numpy.ndarray). Is an array stored for each intermediate operation? Could such compound expressions have multiple times the memory than the underlying ndarray?

Solution

You are correct, a new array will be allocated for each intermediate result. Fortunately, the package numexpr is designed to deal with this issue. From the description:

The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general. Due to this, NumExpr works best with large arrays.

Example:

In [97]: xs = np.random.rand(1_000_000)

In [98]: %timeit xs ** 3 + xs ** 2 + xs
26.8 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [99]: %timeit numexpr.evaluate('xs ** 3 + xs ** 2 + xs')
1.43 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Thanks to @max9111 for pointing out that numexpr simplifies power to multiplication. It seems that most of the discrepancy in the benchmark is explained by optimization of xs ** 3.

In [421]: %timeit xs * xs
1.62 ms ± 12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [422]: %timeit xs ** 2
1.63 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [423]: %timeit xs ** 3
22.8 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [424]: %timeit xs * xs * xs
2.52 ms ± 58.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)