Search code examples
pythonnumpyperformanceoptimizationaugmented-assignment

Why there is no speed benefit of in-place multiplication when returning a numpy array?


I have defined two functions as a minimal working example.

In [2]: A = np.random.random(10_000_000)

In [3]: def f():
   ...:     return A.copy() * np.pi
   ...: 

In [4]: def g():
   ...:     B = A.copy()
   ...:     B *= np.pi
   ...:     return B

Both of them return the same result:

In [5]: assert all(f() == g())

but I would expect g() to be faster, as augmented assignment is (for A) more than 4 times as fast as multiplication:

In [7]: %timeit B = A.copy(); B * np.pi
82.2 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit B = A.copy(); B *= np.pi
55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %timeit B = A.copy()
46.3 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sadly, there is no speedup:

In [10]: %timeit f()
54.5 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [11]: %timeit g()
54.6 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Of course dis.dis(g) shows some overhead when compared to dis.dis(f) (2 * STORE_FAST + 2 * LOAD_FAST):

In [26]: dis.dis(f)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 LOAD_GLOBAL              2 (np)
              8 LOAD_ATTR                3 (pi)
             10 BINARY_MULTIPLY
             12 RETURN_VALUE

In [27]: dis.dis(g)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 STORE_FAST               0 (B)

  3           8 LOAD_FAST                0 (B)
             10 LOAD_GLOBAL              2 (np)
             12 LOAD_ATTR                3 (pi)
             14 INPLACE_MULTIPLY
             16 STORE_FAST               0 (B)

  4          18 LOAD_FAST                0 (B)
             20 RETURN_VALUE

but for A = np.random.random(1) the overhead (difference in execution time) is less than 2 µs.

To make things even more confusing I defined a third function h() which behaves as expected (is slower than f()):

In [19]: def h():
    ...:     B = A.copy()
    ...:     return B * np.pi
    ...: 

In [20]: %timeit h()
81.9 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

but dis.dis(h) gives me no insight why:

In [28]: dis.dis(h)
  2           0 LOAD_GLOBAL              0 (A)
              2 LOAD_METHOD              1 (copy)
              4 CALL_METHOD              0
              6 STORE_FAST               0 (B)

  3           8 LOAD_FAST                0 (B)
             10 LOAD_GLOBAL              2 (np)
             12 LOAD_ATTR                3 (pi)
             14 BINARY_MULTIPLY
             16 RETURN_VALUE

Why there is no speed benefit of in-place multiplication when returning a numpy array, or maybe why f() gets the speed benefit despite of binary multiplication?

I use Python 3.7.12 and numpy 1.21.6.


Solution

  • Your f is benefiting from a temporary elision optimization introduced in NumPy 1.13.

    When NumPy can tell that one of the operands of an arithmetic operator has no other references, it may reuse that operand's memory for the result array. That's the case in your f function - the A.copy() array has no other references.

    However, detecting this situation is very expensive and not always possible. Checking that the refcount is 1 is easy, but NumPy has to inspect the C-level call stack to make sure that the operation is being invoked by the bytecode evaluation loop (which will discard the operand references) instead of an extension module (which might hold onto those references).

    On platforms without the backtrace function, NumPy cannot perform this optimization. Even on platforms with backtrace, the cost of the stack inspection means that NumPy only tries the optimization for arrays of size at least 256 KiB.

    You can see the implementation in numpy/core/src/multiarray/temp_elide.c.