Search code examples
pythonarraysnumpynumba

Any faster way to concatenate arrays in python?


I have a function that does the following:

  • Gets an array and then does arbitrary operations with slices of the array, generating new arrays
  • Concatenates all the new arrays into a single array and returns it.

A very straightforward way i started was with the following code using np.concatenate:

import numpy as np
from numba import njit

arr = np.ones(10000)

@njit(fastmath=True)
def EQ1(a):
    
    eq1 = 2*a[1:-1]
    eq2 = 4*a[0:2]
    
    sys = (eq1,eq2)
    
    sys = np.concatenate(sys)
    
    return sys

Then i decided to look for faster answers, and oddly, doing a numba jitted function appending each individual element to a pre-allocated array is significantly faster:

@njit(fastmath=True)
def EQ2(a):
    
    sys = np.empty_like(a)
    l = 0
    for i in range(1,len(a)-1):
        eq1 = 2*a[i]
        sys[l] = eq1
        l+=1
   
    for i in range(2):
        eq2 = 4*a[i]
        sys[l] = eq2
        l+=1
    
    return sys

%timeit EQ1(arr) 6.31 µs ± 9.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit EQ2(arr) 3.48 µs ± 2.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But the second approach starts occupying a lot of space as the number of eqs grows. So i'm wondering there are any faster alternatives. I already experimented with append(), extend() but the for loop jitted is still faster. Thanks


Solution

  • You're not exactly asking about catenation speed. Rather, to accomplish a task, you are asking "how many memory operations do I need?" With the goal of being lazy, of doing as little as possible.

    We're doing essentially no computation here; it is entirely an exercise focused on moving chunks of memory around.

    EQ1

    These operations have negligible cost and won't be considered further:

        eq2 = 4 * a[0:2]
        sys = (eq1, eq2)
    

    We will take "writing a bunch of ones", arr = np.ones(10_000), as "one memory operation", an array's worth of writes across the bus to DRAM. It is my belief that it happens in a sensible way, with each cache line being written just once. So that is our timing unit.

        eq1 = 2 * a[1:-1]
    

    Here we read one unit, which thanks to caching seems to have zero cost in a run-many-times benching context. We trivially double each item, and send that to a temp var at a cost of one unit.

        sys = np.concatenate(sys)
    

    Now concatenate reads one (cached) unit at zero cost, and incurs the expense of an additional unit of writing to newly allocated storage.

    Total: two units.

    EQ2

    The range(2) loop has negligible cost and won't be considered further. Similarly the .empty_like().

        for i in range(1, len(a) - 1):
            eq1 = 2 * a[i]
            sys[l] = eq1
            l += 1
    

    Reading a from cache we get for free, pretty much. Storing double that to sys costs us one unit. Multiplying by two is not a bottleneck in this pipeline, relative to memory operations. Buffering the result in a register for a moment, rather than in DRAM for a while, is a big win.


    So the temp vars of EQ1 cost two units of memory operations, while EQ2 cost us just one unit.

    Moral of the story: "big" temp vars are costly, as they aren't cache friendly.