python numpy list-comprehension array-broadcasting

Broadcasting + reshaping VS list comprehension speed in python

Given the following numpy arrays:

import numpy as np
import time

v1 = np.linspace(20, 250, 100000000)
a = np.array([12.592,16.320])
m = np.array([3, 5])

How is it possible that list comprehension:

start = time.time()
v2 = np.max(
    [10 ** _a * np.power(v1.astype(float), -_m) for _m, _a in zip(m, a)],
    axis=0
)
end = time.time()
print(end - start)  # prints 5.822041034698486

is more than twice as fast than numpy's broadcasting?

start = time.time()
v2 = np.max(
    np.power(10, a) * np.power(v1.astype(float)[:, None], -m).reshape(
        v1.shape[0],-1),
    axis=1
)
end = time.time()
print(end - start)  # prints 12.292157173156738

And what is the "fastest approach" of calculating v2?

Solution

Both are inefficient. Indeed, you can rewrite the operation much more efficiently using mathematical property of the natural logarithm and the exponential. Here is an equivalent (unoptimized) expression that can be further optimized:

v2 = np.max(
    [np.exp(_a * np.log(10) - _m * np.log(v1.astype(float))) for _m, _a in zip(m, a)],
    axis=0
)

Since np.log(v1.astype(float))) is a constant, you can pre-compute it. Actually, v1 items are already of type float (note that float means np.float64 for Numpy and not the CPython float object type). np.log(v1) will do the job correctly (unless v1 is set to an array of np.float32 for example). Additionally, np.exp can be computed only on the final result (ie. after computing np.max) since a < b is equivalent to e**a < e**b. Finally, you can use in-place operations to avoid creating several expensive temporary arrays. This can be done using the out parameter of many functions like np.subtract or np.multiply for example.

Here is the resulting code:

log_v1 = np.log(v1)
tmp = np.empty((len(a), v1.size), dtype=np.float64)
v2 = np.exp(np.max(
    [np.subtract(_a * np.log(10), np.multiply(log_v1, _m, out=_tmp), out=_tmp) for _m, _a, _tmp in zip(m, a, tmp)],
    axis=0, 
    out=tmp[0]
), out=tmp[0])

For even faster performance, you can simply use Numba so to avoid writing huge array in memory (which is slow). Numba can also use multiple threads to speed this up a lot. Here is an example:

import numba as nb

# Note that fastmath=True can cause issues with values likes NaN Inf.
# Please disable it if your input array contains such spacial values.
@nb.njit('float64[::1](float64[::1], float64[::1], float64[::1])', fastmath=True, parallel=True)
def compute(v1, m, a):
    assert a.size > 0 and a.size == m.size
    out = np.empty(v1.size, dtype=np.float64)
    log10 = np.log(10)
    for i in nb.prange(v1.size):
        log_v1 = np.log(v1[i])
        maxi = a[0] * log10 - m[0] * log_v1
        for j in range(1, len(a)):
            value = a[j] * log10 - m[j] * log_v1
            if value > maxi:
                maxi = value
        out[i] = np.exp(maxi)
    return out

v2 = compute(v1, m.astype(float), a)

Benchmark

Here are results on my 6-core machine:

Initial code with big arrays:    10.013 s    (inefficient)
Initial code with lists:          8.147 s    (inefficient)
Optimized Numpy code:             2.009 s    (memory-bound)
Optimized Numba code:             0.300 s    (compute-bound)

As you can see, Numba is much faster than the other methods: about 30 times faster.