Search code examples
pythonnumpycython

Fastest way to concatenate slices of numpy array


I have a large number of small numpy arrays (groups) of different sizes, and I want to concatenate an arbitrary subset of these groups as fast as possible. The solution I originaly came up with is to store these groups as np.array of np.arrays and then access a subset of groups with list indexing:

groups = []
for i in range(100000):
    size = np.random.randint(3) + 1
    groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups)  # dtype=np.object
indices = np.random.randint(len(groups), size=1000)

%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

However, this solution is inefficient in terms of memory consumption as groups are small (2 elements on average) and I have to store a numpy array structure for each group, which is almost 100 bytes (too much for me).

To make the solution more efficient I've decided to concatenate all groups and store array boundaries in a separate array

data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]

But then concatenation is not obvious at all. Something like this:

%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Works 5 times slower than the original solution. I think this is because of two things. First, iteration over numpy array indices (python wraps c-int into object for each index). Second, python creates numpy structure for each slice/index. I think it's impossible to reduce concatenation time for this case in pure python, so I've decided to come up with a cython solution.

%%cython
import numpy as np
ctypedef long long int64

def concatenate(data, offsets, indices):
    cdef int64[::] data_view = data
    cdef int64[::] indices_view = indices
    cdef int64[::] offsets_view = offsets
    
    size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
    res = np.zeros(size, dtype=np.int64)
    cdef int64[::] res_view = res
    
    cdef int64 i, l = 0, r
    for i in indices_view:
        r = l + offsets_view[i + 1] - offsets_view[i]
        res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
        l = r
    return res

%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This solution is faster than the previous, but still a little slower than the original. But the biggest problem is that I don't know the data type in advance. I've used int64 in the example, but it could be any number type, e.g. float32. Because of that, I cannot use typed memory views as I did. In theory, I only need to know the size of the type (4/8 bytes) and if I have pointers to the data and result arrays, I can use memcpy or something similar to copy slices. But I don't know how to do it in cython. Is there a way to do it?


Solution

  • Here's my pure numpy-only solution, adv_concatenate() function. It gives 15x-47x times speedup (different on different machines) compared to regular np.concatenate().

    NOTE: There is also a second faster solution going after the code of first one.

    For time measurement for used pip module timerit, install it once by python -m pip install timerit. For timings two types of machines were used - first machine is Windows-based, it is same for all tests, it is my home laptop, second machine is Linux-based, it is different machine for every test (so different speed between different tests, but same speed inside one run/test), it is machine of repl.it site that I used to test my code too.

    The idea of algorithm was to use numpy's cumulative sum function (.cumsum()):

    1. We create array of just 1s, size of array equal to total resulting concatenated data array size. This array will hold indexes of all data elements to be fetched to create resulting data array.
    2. Inside this array at starting postitions of each block (small sub-array) we change values to in such a way that after running cumsum() on whole resulting array this starting values are transformed to starting offsets into data array. The remaining values remain 1s.
    3. To this array we apply .cumsum(). Now all values will hold correct indexes of data elements to be fetched.
    4. We form resulting fetched data array by just indexing data with formed above array of indexes.

    This algorithm can be potentially boosted even more if we pre-compute some values like offsets[1:] - offsets[:-1] or offsets[:-1] + 1 and use them inside adv_concatenate() function.

    Try it online!

    # Needs: python -m pip install numpy timerit
    from timerit import Timerit
    import numpy as np
    np.random.seed(0)
    Timerit._default_asciimode = True
    
    groups = []
    for i in range(100000):
        size = np.random.randint(3) + 1
        groups.append(np.random.randint(1000000, size = size))
    groups = np.array(groups)  # dtype=np.object
    indices = np.random.randint(len(groups), size = 1000)
    
    data = np.concatenate(groups)
    offsets = np.cumsum([0] + [len(group) for group in groups])
    
    timer = lambda: Timerit(num = 600, verbose = 1)
    
    print('np.concatenate(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
    tref = tim.mean()
    
    def adv_concatenate(data, offsets, indices):
        begs, ends = offsets[indices], offsets[indices + 1]
        lens = ends - begs
        clens = lens.cumsum()
        ix = np.ones((clens[-1],), dtype = offsets.dtype)
        ix[0] = begs[0]
        ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
        ix = ix.cumsum()
        return data[ix]
        
    print('adv_concatenate(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            adv = adv_concatenate(data, offsets, indices)
    tadv = tim.mean()
    assert np.array_equal(ref, adv) # Check that our solution is correct
    
    print('speedup:', round(tref / tadv, 3))
    

    Outputs:

    On first machine (Windows):

    np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
    adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
    speedup: 15.504
    

    On second machine (Linux):

    np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
    adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
    speedup: 47.532
    

    Second solution is even more faster than first one giving 40x-150x times speedup (different on different machines) compared to regular np.concatenate(). But second solution uses Numba JIT LLVM-based compiler, that needs to be installed via python -m pip install numba.

    Although it uses extra numba package still central function adv_concatenate_indexes_numba() is very simple, same amount of lines of code as in first solution. Also algorithm is much simpler just, two simple loops.

    Current solution works for any data type, because central function just computes resulting indexes, hence doesn't work with data's dtype at all. Also current solution can be boosted by 10%-90% more if instead of computing indexes numba function will compute straight resulting data array, but this is only possible for quite simple data types that are supported by numba, including all number types. Here is the code (or here) for this improved solution which achieves up to 250x speedup! Timings for this improved version on second machine (Linux):

    np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
    adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
    speedup: 197.46
    

    More generic (only-indexes-computing) solution's code is going next:

    Try it online!

    # Needs: python -m pip install numpy numba timerit
    from timerit import Timerit
    import numpy as np, numba
    np.random.seed(0)
    Timerit._default_asciimode = True
    
    groups = []
    for i in range(100000):
        size = np.random.randint(3) + 1
        groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
    groups = np.array(groups)  # dtype=np.object
    indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
    
    data = np.concatenate(groups)
    offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
    
    timer = lambda: Timerit(num = 600, verbose = 1)
    
    print('np.concatenate(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
    tref = tim.mean()
    
    @numba.njit('i8[:](i8[:], i8[:])', cache = True)
    def adv_concatenate_indexes_numba(offsets, indices):
        tlen = 0
        for i in range(indices.size):
            ix = indices[i]
            tlen += offsets[ix + 1] - offsets[ix]
            
        pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
        for i in range(indices.size):
            ix = indices[i]
            for j in range(offsets[ix], offsets[ix + 1]):
                r[pos] = j
                pos += 1
                
        return r
        
    def adv_concatenate2(data, offsets, indices):
        return data[adv_concatenate_indexes_numba(offsets, indices)]
        
    adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
    print('adv_concatenate2(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            adv = adv_concatenate2(data, offsets, indices)
    tadv = tim.mean()
    assert np.array_equal(ref, adv) # Check that our solution is correct
    
    print('speedup:', round(tref / tadv, 3))
    

    Output:

    On first machine (Windows):

    np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
    adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
    speedup: 40.442
    

    On second machine (Linux):

    np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms
    adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us
    speedup: 149.716
    

    Inspired by Cython code of @pavelgramovich answer I've also decided to implement my simplified version with loop (func concatenate1()) instead of memcpy() version (func concatenate0()), simplified version appeared to be 1.5-2x faster than memcpy version for current test data. Full code comparing both versions down below:

    Try it online!

    # Needs: python -m pip install numpy timerit cython setuptools
    from timerit import Timerit
    import numpy as np
    np.random.seed(0)
    Timerit._default_asciimode = True
    
    groups = []
    for i in range(100000):
        size = np.random.randint(3) + 1
        groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
    groups = np.array(groups)  # dtype=np.object
    indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
    
    data = np.concatenate(groups)
    offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
    
    timer = lambda: Timerit(num = 600, verbose = 1)
    
    def compile_cy_cats():
        src = """
    import numpy as np
    cimport numpy as np
    cimport cython
    from libc.string cimport memcpy 
    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
        data = np.ascontiguousarray(data)
        start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
        end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
        cdef np.int64_t[::1] coffsets = start_offsets
        cdef np.int64_t[::1] csizes = end_offsets - start_offsets
        
        cdef np.int64_t i, total_size = 0
        for i in range(csizes.shape[0]):
            total_size += csizes[i]
        res = np.empty(total_size, dtype=data.dtype)
    
        cdef np.ndarray cdata = data
        cdef np.ndarray cres = res
        
        cdef np.int64_t itemsize = data.itemsize
        cdef np.int64_t res_offset = 0
        for i in range(csizes.shape[0]):
            memcpy(cres.data + res_offset * itemsize, 
                   cdata.data + coffsets[i] * itemsize, 
                   csizes[i] * itemsize)
            res_offset += csizes[i]
        
        return res
    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
        cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
        
        for i in range(ixs):
            ix = indices[i]
            tlen += offsets[ix + 1] - offsets[ix]
            
        r = np.empty(tlen, dtype = np.int64)
        cdef np.int64_t[:] cr = r, cdata = data
    
        for i in range(ixs):
            ix = indices[i]
            for j in range(offsets[ix], offsets[ix + 1]):
                cr[pos] = cdata[j]
                pos += 1
        
        return r
        """
        
        srcb = src.encode('utf-8')
        
        import hashlib, os, glob, importlib
        srch = hashlib.sha256(srcb).hexdigest().upper()[:8]
    
        if len(glob.glob(f'cy{srch}*')) == 0:
            with open(f'cys{srch}.pyx', 'wb') as f:
                f.write(srcb)
    
            import sys
            from setuptools import setup, Extension
            from Cython.Build import cythonize
            import numpy as np
    
            sys.argv += ['build_ext', '--inplace']
            setup(
                ext_modules = cythonize(
                    Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
                ),
                include_dirs = [np.get_include()],
            )
            del sys.argv[-2:]
    
        print('Cython module:', f'cy{srch}')
        return importlib.import_module(f'cy{srch}')
    
    cy_cats = compile_cy_cats()
    concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1
    
    print('np.concatenate(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
    tref = tim.mean()
    
    concatenate0(data, offsets, indices) # Maybe pre-heat
    print('cy_concatenate0(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            adv0 = concatenate0(data, offsets, indices)
    tadv0 = tim.mean()
    assert np.array_equal(ref, adv0) # Check that our solution is correct
    
    print('speedup:', round(tref / tadv0, 3))
    
    concatenate1(data, offsets, indices) # Maybe pre-heat
    print('cy_concatenate1(): ', end = '', flush = True)
    tim = timer()
    for t in tim:
        with t:
            adv1 = concatenate1(data, offsets, indices)
    tadv1 = tim.mean()
    assert np.array_equal(ref, adv1) # Check that our solution is correct
    
    print('speedup:', round(tref / tadv1, 3))
    

    Output:

    First machine (Windows):

    Cython module: cy0BEBA0C8
    np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
    cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
    speedup: 25.354
    cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
    speedup: 34.821
    

    Second machine (Linux):

    Cython module: cy0BEBA0C8
    np.concatenate(): Timed best=1.630 ms, mean=2.215 +- 0.5 ms
    cy_concatenate0(): Timed best=21.839 us, mean=28.930 +- 8.4 us
    speedup: 76.578
    cy_concatenate1(): Timed best=11.447 us, mean=15.263 +- 5.1 us
    speedup: 145.151