`numpy.astype` is slower than for loops for large arrays

I'm trying to use numpy to vectorize my code more and see performance benefits, but I was surprised to find that my code runs slower when using numpy.

I'm doing a very simple test to compare "native python" and "vectorization":

Setup:

from timeit import timeit

import numpy as np

TRIALS = 100

TIMESTAMPS = 10
CAMERAS = 5
HEIGHT = 240
WIDTH = 320
NUM_CUSTOMERS = 10

np.random.seed(0)

FRAMES = np.random.randint(0, NUM_CUSTOMERS, (TIMESTAMPS, CAMERAS, HEIGHT, WIDTH), dtype=np.uint8)

Native Python:

print("Native Python:")

def native_python():
    for t in range(TIMESTAMPS):
        for c in range(CAMERAS):
            (FRAMES[t, c] == 3).astype(np.uint8)
            (FRAMES[t, c] == 4).astype(np.uint8)
            (FRAMES[t, c] == 5).astype(np.uint8)

print(timeit(lambda: native_python(), number=1_000))

Vectorized Python:

print("Vectorized:")
def vectorized():
    (FRAMES == 3).astype(np.uint8)
    (FRAMES == 4).astype(np.uint8)
    (FRAMES == 5).astype(np.uint8)

print(timeit(lambda: vectorized(), number=1_000))

Here is information about my system:

$ uname
Linux
$ nproc
16

When I ran this on my computer, I was surprised to see Vectorization was 8 times slower:

Native Python:
0.9657893369439989
Vectorized:
8.072781160008162

I expected numpy to magically revert to native for loops even if this vectorized solution uses such giant loops.

Appreciate any help!

Solution

You are actually comparing two vectorized implementations and the later is more expensive due to the memory hierarchy overheads (as pointed ou by @hpaulj in comments). The overhead of CPython loops is small in both cases.

More specifically, the former code operates on arrays of size 240*320 = 76800. which fits in the L2 cache on all mainstream CPUs while the later operate on arrays of size 3_840_000 which does not fit in the L2 cache but the L3 on some CPU or only the RAM on others. The L3 cache is slower than the L2 and the RAM si much slower than the L3 cache. CPU caches closer to cores are much faster but also much smaller. Because of that, doing the computation chunk by chunk like you do in the first code is a good practice to improve temporal memory locality and thus performance.

One reason why temporal memory locality is improved here is because the internal allocator generally tends to recycle array (that is providing back the address of a newly deleted array of a similar size). This often happens on temporary arrays created by Numpy. Another reason is that temporary array are written and often directly read again and reading the array back is faster if the array is in a fast CPU cache. For example, FRAMES == 3 creates and fill a boolean array which is directly read back by the astype function.

On my machine with a i5-9600KF CPU, the L3 is large enough (9 MiB) to (mostly) hold the arrays so the difference in performance is not so huge. I expect this not to be the case on your machine.

However, note that .astype(np.uint8) is rather expensive here and not even needed in this case. Indeed, you can use .view(np.uint8). The former create a new array and convert items while the later just do a reinterpret cast of the item in memory without any copy. The later is only safe if you know exactly what you are doing. Here it is safe since the input (np.bool_) and output (np.uint8) types are of the same size (1 byte), both are unsigned and of a compatible kind (boolean versus integers). Once this modification is done, the code is much faster.

Here are performance results on my machine (with Cpython 3.8.1, Numpy 1.24.4 on Windows):

Initial implementation:
    - Native Python:  5.990550100000291
    - Vectorized:     9.843130900000233

Optimized implementation (using np.view):
    - Native Python:  0.9843445999999858
    - Vectorized:     2.8540657999997165

In the optimized version of the "Native Python" implementation, the big input array completely fit in the L3 and small temporary array do not impact is much so data is not evicted. In the second implementation data is more often evicted from the L3 cache because the temporary array is relatively big (the minimum space required is 3840000*2=7680000 bytes assuming the allocator and the cache optimal, but they are not and there is not enough space in my L3 cache for 3 times the array size, that is 3840000*3=11520000 bytes, hence a significantly higher execution time, not to mention the "Native Python" mostly operates in the L1 cache for temporary arrays).