Fastest way to concatenate slices of numpy array

I have a large number of small numpy arrays (groups) of different sizes, and I want to concatenate an arbitrary subset of these groups as fast as possible. The solution I originaly came up with is to store these groups as np.array of np.arrays and then access a subset of groups with list indexing:

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size=1000)

%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, this solution is inefficient in terms of memory consumption as groups are small (2 elements on average) and I have to store a numpy array structure for each group, which is almost 100 bytes (too much for me).

To make the solution more efficient I've decided to concatenate all groups and store array boundaries in a separate array

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]

But then concatenation is not obvious at all. Something like this:

%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Works 5 times slower than the original solution. I think this is because of two things. First, iteration over numpy array indices (python wraps c-int into object for each index). Second, python creates numpy structure for each slice/index. I think it's impossible to reduce concatenation time for this case in pure python, so I've decided to come up with a cython solution.

%%cython
import numpy as np
ctypedef long long int64

def concatenate(data, offsets, indices):
    cdef int64[::] data_view = data
    cdef int64[::] indices_view = indices
    cdef int64[::] offsets_view = offsets
    
    size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
    res = np.zeros(size, dtype=np.int64)
    cdef int64[::] res_view = res
    
    cdef int64 i, l = 0, r
    for i in indices_view:
        r = l + offsets_view[i + 1] - offsets_view[i]
        res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
        l = r
    return res

%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This solution is faster than the previous, but still a little slower than the original. But the biggest problem is that I don't know the data type in advance. I've used int64 in the example, but it could be any number type, e.g. float32. Because of that, I cannot use typed memory views as I did. In theory, I only need to know the size of the type (4/8 bytes) and if I have pointers to the data and result arrays, I can use memcpy or something similar to copy slices. But I don't know how to do it in cython. Is there a way to do it?

Solution

Here's my pure numpy-only solution, adv_concatenate() function. It gives 15x-47x times speedup (different on different machines) compared to regular np.concatenate().

NOTE: There is also a second faster solution going after the code of first one.

For time measurement for used pip module timerit, install it once by python -m pip install timerit. For timings two types of machines were used - first machine is Windows-based, it is same for all tests, it is my home laptop, second machine is Linux-based, it is different machine for every test (so different speed between different tests, but same speed inside one run/test), it is machine of repl.it site that I used to test my code too.

The idea of algorithm was to use numpy's cumulative sum function (.cumsum()):

We create array of just 1s, size of array equal to total resulting concatenated data array size. This array will hold indexes of all data elements to be fetched to create resulting data array.
Inside this array at starting postitions of each block (small sub-array) we change values to in such a way that after running cumsum() on whole resulting array this starting values are transformed to starting offsets into data array. The remaining values remain 1s.
To this array we apply .cumsum(). Now all values will hold correct indexes of data elements to be fetched.
We form resulting fetched data array by just indexing data with formed above array of indexes.

This algorithm can be potentially boosted even more if we pre-compute some values like offsets[1:] - offsets[:-1] or offsets[:-1] + 1 and use them inside adv_concatenate() function.

Try it online!

# Needs: python -m pip install numpy timerit
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

def adv_concatenate(data, offsets, indices):
    begs, ends = offsets[indices], offsets[indices + 1]
    lens = ends - begs
    clens = lens.cumsum()
    ix = np.ones((clens[-1],), dtype = offsets.dtype)
    ix[0] = begs[0]
    ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
    ix = ix.cumsum()
    return data[ix]
    
print('adv_concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

Outputs:

On first machine (Windows):

np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
speedup: 15.504

On second machine (Linux):

np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
speedup: 47.532

Second solution is even more faster than first one giving 40x-150x times speedup (different on different machines) compared to regular np.concatenate(). But second solution uses Numba JIT LLVM-based compiler, that needs to be installed via python -m pip install numba.

Although it uses extra numba package still central function adv_concatenate_indexes_numba() is very simple, same amount of lines of code as in first solution. Also algorithm is much simpler just, two simple loops.

Current solution works for any data type, because central function just computes resulting indexes, hence doesn't work with data's dtype at all. Also current solution can be boosted by 10%-90% more if instead of computing indexes numba function will compute straight resulting data array, but this is only possible for quite simple data types that are supported by numba, including all number types. Here is the code (or here) for this improved solution which achieves up to 250x speedup! Timings for this improved version on second machine (Linux):

np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
speedup: 197.46

More generic (only-indexes-computing) solution's code is going next:

Try it online!

# Needs: python -m pip install numpy numba timerit
from timerit import Timerit
import numpy as np, numba
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

@numba.njit('i8[:](i8[:], i8[:])', cache = True)
def adv_concatenate_indexes_numba(offsets, indices):
    tlen = 0
    for i in range(indices.size):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
    for i in range(indices.size):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            r[pos] = j
            pos += 1
            
    return r
    
def adv_concatenate2(data, offsets, indices):
    return data[adv_concatenate_indexes_numba(offsets, indices)]
    
adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
print('adv_concatenate2(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv = adv_concatenate2(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct

print('speedup:', round(tref / tadv, 3))

Output:

On first machine (Windows):

np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
speedup: 40.442

On second machine (Linux):

np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms
adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us
speedup: 149.716

Inspired by Cython code of @pavelgramovich answer I've also decided to implement my simplified version with loop (func concatenate1()) instead of memcpy() version (func concatenate0()), simplified version appeared to be 1.5-2x faster than memcpy version for current test data. Full code comparing both versions down below:

Try it online!

# Needs: python -m pip install numpy timerit cython setuptools
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)

timer = lambda: Timerit(num = 600, verbose = 1)

def compile_cy_cats():
    src = """
import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy 

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
    data = np.ascontiguousarray(data)
    start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
    end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
    cdef np.int64_t[::1] coffsets = start_offsets
    cdef np.int64_t[::1] csizes = end_offsets - start_offsets
    
    cdef np.int64_t i, total_size = 0
    for i in range(csizes.shape[0]):
        total_size += csizes[i]
    res = np.empty(total_size, dtype=data.dtype)

    cdef np.ndarray cdata = data
    cdef np.ndarray cres = res
    
    cdef np.int64_t itemsize = data.itemsize
    cdef np.int64_t res_offset = 0
    for i in range(csizes.shape[0]):
        memcpy(cres.data + res_offset * itemsize, 
               cdata.data + coffsets[i] * itemsize, 
               csizes[i] * itemsize)
        res_offset += csizes[i]
    
    return res

@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
    cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
    
    for i in range(ixs):
        ix = indices[i]
        tlen += offsets[ix + 1] - offsets[ix]
        
    r = np.empty(tlen, dtype = np.int64)
    cdef np.int64_t[:] cr = r, cdata = data

    for i in range(ixs):
        ix = indices[i]
        for j in range(offsets[ix], offsets[ix + 1]):
            cr[pos] = cdata[j]
            pos += 1
    
    return r
    """
    
    srcb = src.encode('utf-8')
    
    import hashlib, os, glob, importlib
    srch = hashlib.sha256(srcb).hexdigest().upper()[:8]

    if len(glob.glob(f'cy{srch}*')) == 0:
        with open(f'cys{srch}.pyx', 'wb') as f:
            f.write(srcb)

        import sys
        from setuptools import setup, Extension
        from Cython.Build import cythonize
        import numpy as np

        sys.argv += ['build_ext', '--inplace']
        setup(
            ext_modules = cythonize(
                Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
            ),
            include_dirs = [np.get_include()],
        )
        del sys.argv[-2:]

    print('Cython module:', f'cy{srch}')
    return importlib.import_module(f'cy{srch}')

cy_cats = compile_cy_cats()
concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1

print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()

concatenate0(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate0(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv0 = concatenate0(data, offsets, indices)
tadv0 = tim.mean()
assert np.array_equal(ref, adv0) # Check that our solution is correct

print('speedup:', round(tref / tadv0, 3))

concatenate1(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate1(): ', end = '', flush = True)
tim = timer()
for t in tim:
    with t:
        adv1 = concatenate1(data, offsets, indices)
tadv1 = tim.mean()
assert np.array_equal(ref, adv1) # Check that our solution is correct

print('speedup:', round(tref / tadv1, 3))

Output:

First machine (Windows):

Cython module: cy0BEBA0C8
np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
speedup: 25.354
cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
speedup: 34.821

Second machine (Linux):

Cython module: cy0BEBA0C8
np.concatenate(): Timed best=1.630 ms, mean=2.215 +- 0.5 ms
cy_concatenate0(): Timed best=21.839 us, mean=28.930 +- 8.4 us
speedup: 76.578
cy_concatenate1(): Timed best=11.447 us, mean=15.263 +- 5.1 us
speedup: 145.151