Search code examples
pythonnumpyoptimizationnumbacpu-cache

Optimization Challenge Due to L1 Cache with Numba


I've been working on optimizing the calculation of differences between elements in NumPy arrays. I have been using Numba for performance improvements, but I get a 100-microsecond jump when the array size surpasses 1 MB. I assume this is due to my CPU's Ryzen 7950X 1 MB L1 cache size.

Here is an example code:

@jit(nopython=True)
def extract_difference_1(random_array):
    shape0, shape1 = random_array.shape
    difference_arr = np.empty((shape0, shape1), dtype=np.float64)
    for i in range(shape0):
        difference_arr[i] = random_array[i,0] - random_array[i,1], random_array[i,1] - random_array[i,2], random_array[i,2] - random_array[i,3], random_array[i,3] - random_array[i,4], random_array[i,4] - random_array[i,5], random_array[i,5] - random_array[i,6], random_array[i,6] - random_array[i,0]

    return difference_arr

@jit(nopython=True)
def extract_difference_2(random_array):
    shape0, shape1 = random_array.shape
    split_index = shape0 // 2
    part_1 = extract_difference_1(random_array[:split_index])
    part_2 = extract_difference_1(random_array[split_index:])

    return part_1 , part_2

x_list = [18500, 18700, 18900]
y = 7
for x in x_list:
    random_array = np.random.rand(x, y)
    print(f"\nFor (x,y) = ({x}, {y}), random_array size is {array_size_string(random_array)}:\n")
    for func in [extract_difference_1, extract_difference_2]:
        func(random_array) # compile the function
        timing_result = %timeit -q -o func(random_array)
        print(f"{func.__name__}:\t {timing_result_message(timing_result)}")

The timing results are:

For (x,y) = (18500, 7), random_array size is 0.988 MB, 1011.72 KB:

extract_difference_1:    32.4 µs ± 832 ns,   b: 31.5 µs,    w: 34.3 µs,     (l: 7, r: 10000),
extract_difference_2:    33.8 µs ± 279 ns,   b: 33.5 µs,    w: 34.3 µs,     (l: 7, r: 10000),

For (x,y) = (18700, 7), random_array size is 0.999 MB, 1022.66 KB:

extract_difference_1:    184 µs ± 2.15 µs,   b: 181 µs,     w: 188 µs,  (l: 7, r: 10000),
extract_difference_2:    34.4 µs ± 51.2 ns,  b: 34.3 µs,    w: 34.5 µs,     (l: 7, r: 10000),

For (x,y) = (18900, 7), random_array size is 1.009 MB, 1033.59 KB:

extract_difference_1:    201 µs ± 3.3 µs,    b: 196 µs,     w: 205 µs,  (l: 7, r: 10000),
extract_difference_2:    34.5 µs ± 75.2 ns,  b: 34.4 µs,    w: 34.6 µs,     (l: 7, r: 10000),

Splitting the resulting difference_arr into two does it, but I prefer if the result is a single array. Especially as later, I will be increasing the y to 10, 50, 100, 1000 and x to 20000. When combining the split arrays part_1 and part_2 into the difference_arr, I found it slower than extract_difference_1. I think the slowdown is due to the extract_difference_1 being larger than 1 MB, resulting in L1 cache not being used.

Is there a way to maintain the performance while having the result be a single array with Python, Numba or any other package? Or is there a way that will allow me to recombine these arrays without a performance penalty for the resulting array exceeding the L1 cache size?


Solution

  • TL;DR: The performance issue is not caused by your CPU cache. It comes from the behaviour of the allocator on your target platform which is certainly Windows.


    Analysis

    I assume this is due to my CPU's Ryzen 7950X 1 MB L1 cache size.

    First of all, the AMD Ryzen 7950X CPU is a Zen4 CPU. This architecture have L1D caches of 32 KiB not 1 MiB. That being said, the L2 cache is 1 MiB on this architecture.

    While the cache-size hypothesis is a tempting idea at first glance. There are two major issues with it:

    First, the same amount of data is read and written by the two functions. The fact that the array is split in two parts does not change this fact. Thus, if cache misses happens in the first function due to the L2 capacity, it should also be the case on the other function. Regarding memory accesses, the only major difference between the two function is the order of the access which should not have a significant performance impact anyway (since the array is sufficiently large so latency issues are mitigated).

    Moreover, the L2 cache on Zen4 is not so much slower than the L3 one. Indeed, It should not be more than twice slower while experimental results show a >5x times bigger execution time.

    I can reproduce this on a Cascade Lake CPU (with a L2 cache of also 1 MiB) on Windows. Here is the result:

    For (x,y) = (18500, 7), random_array size is 0.988006591796875:
    
    extract_difference_1:    68.6 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    extract_difference_2:    70.8 µs ± 5.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    For (x,y) = (18700, 7), random_array size is 0.998687744140625:
    
    extract_difference_1:    342 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    extract_difference_2:    69.7 µs ± 2.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    For (x,y) = (18900, 7), random_array size is 1.009368896484375:
    
    extract_difference_1:    386 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    extract_difference_2:    67 µs ± 4.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    New hypothesis: allocation overheads

    Splitting the resulting difference_arr into two does it

    The main difference between the two functions is that one performs 2 small allocations rather than 1 big. This rises a new hypothesis: can the allocation timings explain the issue?

    We can easily answer this question based on this previous post: Why is allocation using np.empty not O(1). We can see that there is a big performance gap between allocations of 0.76 MiB (np.empty(10**5)) and the next bigger one >1 MiB. Here are the provided results of the target answer:

    np.empty(10**5)   # 620 ns ± 2.83 ns per loop    (on 7 runs, 1000000 loops each)
    np.empty(10**6)   # 9.61 µs ± 34.2 ns per loop   (on 7 runs, 100000 loops each)
    

    More precisely, here is new benchmarks on my current machine:

    %timeit -n 10_000 np.empty(1000*1024, np.uint8)
    793 ns ± 18.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %timeit -n 10_000 np.empty(1024*1024, np.uint8)
    6.6 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    We can see that the gap is close to 1 MiB. Note that the timings between 1000 KiB and 1024 are not very stable (showing that the result is dependent of hidden low-level parameters -- possibly packing/alignment issues).

    This Numpy allocation behaviour is AFAIK specific to Windows and AFAIR not visible on Linux (gaps might be seen but not that big and not at the same threshold).

    An explanation is provided in the linked answer : expensive kernel calls are performed beyond a threshold (huge-pages might also play a role too).


    Solutions

    Is there a way to maintain the performance while having the result be a single array with Python

    Yes. You can preallocate the output array memory so not to pay the expensive allocation overhead. An alternative solution is to use another allocator (e.g. jemalloc, tcmalloc).

    Here is a modified code preallocating memory:

    @nb.jit(nopython=True)
    def extract_difference_1(random_array, scratchMem):
        shape0, shape1 = random_array.shape
        difference_arr = scratchMem[:shape0*shape1].reshape((shape0, shape1))#np.empty((shape0, shape1), dtype=np.float64)
        for i in range(shape0):
            difference_arr[i] = random_array[i,0] - random_array[i,1], random_array[i,1] - random_array[i,2], random_array[i,2] - random_array[i,3], random_array[i,3] - random_array[i,4], random_array[i,4] - random_array[i,5], random_array[i,5] - random_array[i,6], random_array[i,6] - random_array[i,0]
    
        return difference_arr
    
    @nb.jit(nopython=True)
    def extract_difference_2(random_array, scratchMem):
        shape0, shape1 = random_array.shape
        split_index = shape0 // 2
        part_1 = extract_difference_1(random_array[:split_index], np.empty((split_index, shape1)))
        part_2 = extract_difference_1(random_array[split_index:], np.empty((split_index, shape1)))
    
        return part_1 , part_2
    
    x_list = [18500, 18700, 18900]
    y = 7
    scratchMem = np.empty(16*1024*1024)
    for x in x_list:
        random_array = np.random.rand(x, y)
        print(f"\nFor (x,y) = ({x}, {y}), random_array size is {x*y*8/1024/1024}:\n")
        for func in [extract_difference_1, extract_difference_2]:
            func(random_array, scratchMem) # compile the function
            timing_result = %timeit -q -o func(random_array, scratchMem)
            print(f"{func.__name__}:\t {timing_result}")
    

    Here is the result:

    For (x,y) = (18500, 7), random_array size is 0.988006591796875:
    
    extract_difference_1:    65.1 µs ± 2.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    extract_difference_2:    71 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    For (x,y) = (18700, 7), random_array size is 0.998687744140625:
    
    extract_difference_1:    69.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    extract_difference_2:    68.3 µs ± 3.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    For (x,y) = (18900, 7), random_array size is 1.009368896484375:
    
    extract_difference_1:    68.5 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    extract_difference_2:    68.7 µs ± 3.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    We can see that the problem is now gone! Thus, this confirms the hypothesis that allocations were the main source of the performance issue.