I have defined two functions as a minimal working example.
In [2]: A = np.random.random(10_000_000)
In [3]: def f():
...: return A.copy() * np.pi
...:
In [4]: def g():
...: B = A.copy()
...: B *= np.pi
...: return B
Both of them return the same result:
In [5]: assert all(f() == g())
but I would expect g()
to be faster, as augmented assignment is (for A
) more than 4 times as fast as multiplication:
In [7]: %timeit B = A.copy(); B * np.pi
82.2 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit B = A.copy(); B *= np.pi
55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: %timeit B = A.copy()
46.3 ms ± 664 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sadly, there is no speedup:
In [10]: %timeit f()
54.5 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [11]: %timeit g()
54.6 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Of course dis.dis(g)
shows some overhead when compared to dis.dis(f)
(2 * STORE_FAST + 2 * LOAD_FAST):
In [26]: dis.dis(f)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 LOAD_GLOBAL 2 (np)
8 LOAD_ATTR 3 (pi)
10 BINARY_MULTIPLY
12 RETURN_VALUE
In [27]: dis.dis(g)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 STORE_FAST 0 (B)
3 8 LOAD_FAST 0 (B)
10 LOAD_GLOBAL 2 (np)
12 LOAD_ATTR 3 (pi)
14 INPLACE_MULTIPLY
16 STORE_FAST 0 (B)
4 18 LOAD_FAST 0 (B)
20 RETURN_VALUE
but for A = np.random.random(1)
the overhead (difference in execution time) is less than 2 µs.
To make things even more confusing I defined a third function h()
which behaves as expected (is slower than f()
):
In [19]: def h():
...: B = A.copy()
...: return B * np.pi
...:
In [20]: %timeit h()
81.9 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
but dis.dis(h)
gives me no insight why:
In [28]: dis.dis(h)
2 0 LOAD_GLOBAL 0 (A)
2 LOAD_METHOD 1 (copy)
4 CALL_METHOD 0
6 STORE_FAST 0 (B)
3 8 LOAD_FAST 0 (B)
10 LOAD_GLOBAL 2 (np)
12 LOAD_ATTR 3 (pi)
14 BINARY_MULTIPLY
16 RETURN_VALUE
Why there is no speed benefit of in-place multiplication when returning a numpy array, or maybe why f()
gets the speed benefit despite of binary multiplication?
I use Python 3.7.12 and numpy 1.21.6.
Your f
is benefiting from a temporary elision optimization introduced in NumPy 1.13.
When NumPy can tell that one of the operands of an arithmetic operator has no other references, it may reuse that operand's memory for the result array. That's the case in your f
function - the A.copy()
array has no other references.
However, detecting this situation is very expensive and not always possible. Checking that the refcount is 1 is easy, but NumPy has to inspect the C-level call stack to make sure that the operation is being invoked by the bytecode evaluation loop (which will discard the operand references) instead of an extension module (which might hold onto those references).
On platforms without the backtrace
function, NumPy cannot perform this optimization. Even on platforms with backtrace
, the cost of the stack inspection means that NumPy only tries the optimization for arrays of size at least 256 KiB.
You can see the implementation in numpy/core/src/multiarray/temp_elide.c.