Numpy `np.prod`, `np.all` are slower than explicit algebraic operations

In the below experiments, explicit algebraic operations are faster than np.all and np.prod.

What would be the reasons for that?

import numpy as np
import time

N_ROWS = int(2e5)
N_ITER = int(1e3)

np.random.seed(123)

mat = np.random.rand(N_ROWS, 2)
mat_bool = mat > .5

# np.prod
start = time.time()
for i in range(N_ITER):
    _ = np.prod(mat, axis=1)
duration = time.time() - start
print(f"np.prod took {duration}s")

start = time.time()
for i in range(N_ITER):
    _ = mat[:, 0] * mat[:, 1]
duration = time.time() - start
print(f"Manual prod took {duration}s")

# np.all
start = time.time()
for i in range(N_ITER):
    _ = np.all(mat_bool, axis=1)
duration = time.time() - start
print(f"np.all took {duration}s")

start = time.time()
for i in range(N_ITER):
    _ = mat_bool[:, 0] * mat_bool[:, 1]
duration = time.time() - start
print(f"manual all took {duration}s")

Results

np.prod took 2.5077707767486572s
Manual prod took 0.0815896987915039s
np.all took 2.831434488296509s
manual all took 0.11392521858215332s

(Numpy version '1.24.2')

Solution

The main problem is that the Numpy implementation is not designed to be efficient when operations iterates on a very small axis.

Numpy operations iterating on a very-small axis are pretty inefficient. THis is due to Numpy internal overheads. Especially Numpy internal iterator. Ufuncs are also a significant source of overhead in the current implementation. All of this is described in this post. A part of the overheads can be reduced but I do not expect a massive improvement anytime soon. Another cannot be reduced without introducing significant maintainability issue in the Numpy implementation.

The best thing to do (as pointed in the above post), is to operate on a transposed array layout. This not only significantly reduce the Numpy overhead (which are still not negligible), but also make the operation SIMD-friendly which is critical for performance nowadays.

Moreover, you can use object methods instead of global Numpy functions to reduce a bit the Numpy startup overheads. Like mat_bool.all(axis=0) instead of np.all(mat_bool, axis=0). This is only worth it if the arrays are small (otherwise the startup overheads should be relatively small).

Here is the modified code:

import numpy as np
import time

N_ROWS = int(2e5)
N_ITER = int(1e3)

np.random.seed(123)

mat = np.random.rand(N_ROWS, 2)

# Actual (physical) array transposition
# A copy is needed since mat.T is just a view with the same memory layout
mat = mat.T.copy()

mat_bool = mat > .5

# np.prod
start = time.time()
for i in range(N_ITER):
    _ = mat.prod(axis=0)
duration = time.time() - start
print(f"np.prod took {duration}s")

start = time.time()
for i in range(N_ITER):
    _ = mat[0, :] * mat[1, :]
duration = time.time() - start
print(f"Manual prod took {duration}s")

# np.all
start = time.time()
for i in range(N_ITER):
    _ = mat_bool.all(axis=0)
duration = time.time() - start
print(f"np.all took {duration}s")

start = time.time()
for i in range(N_ITER):
    _ = mat_bool[0, :] * mat_bool[1, :]
duration = time.time() - start
print(f"manual all took {duration}s")

Here are initial results on my i5-9600KF CPU with Numpy 1.24.3 on Windows:

np.prod took 2.087920904159546s
Manual prod took 0.4368312358856201s
np.all took 3.4697189331054688s
manual all took 1.1419456005096436s

Here are results of the provided code:

np.prod took 0.5076425075531006s
Manual prod took 0.4059138298034668s
np.all took 0.03490638732910156s
manual all took 0.014960050582885742s

Here are the speed-up between the initial code and the provided one:

np.prod:       4.11
Manual prod:   1.08
np.all:       99.40
manual all:   76.33

The speed-up is massive. There is still a benefit to do manual operations, but the gap is significantly smaller (respectively 25% and 133% for the provided code, as opposed to respectively 378% and 204% for the initial code).

Even once transposed, a part of the overhead is still due to ufuncs as pointed out by hpaulj, but the overhead are now almost acceptable. Here are the timings:

In [94]: %timeit np.multiply(mat[0,:],mat[1::])
400 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [95]: %timeit np.multiply.reduce(mat, axis=0)
510 µs ± 5.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Regarding the speed of booleans operations, they are now so fast that the time spent in Numpy overhead start to be significant. At this scale (microseconds), I do not think it is worth spending time understanding why a function is faster/slower than another since both will likely be slow anyway and a faster implementation can often be designed thanks to vectorization and compiler tools like Numba/Cython in this specific case.

Also please note that np.logical_and is better suited than a product for booleans (and it is equally fast on my machine though it can be theoretically faster).