I have a large number of small numpy arrays (groups) of different sizes, and I want to concatenate an arbitrary subset of these groups as fast as possible. The solution I originaly came up with is to store these groups as np.array of np.arrays and then access a subset of groups with list indexing:
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size=size))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size=1000)
%%timeit
np.concatenate(groups[indices])
>>> 204 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
However, this solution is inefficient in terms of memory consumption as groups are small (2 elements on average) and I have to store a numpy array structure for each group, which is almost 100 bytes (too much for me).
To make the solution more efficient I've decided to concatenate all groups and store array boundaries in a separate array
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
# ith group is data[offsets[i]: offsets[i + 1]]
But then concatenation is not obvious at all. Something like this:
%%timeit
np.concatenate([data[offsets[i]: offsets[i + 1]] for i in indices])
>>> 1.02 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Works 5 times slower than the original solution. I think this is because of two things. First, iteration over numpy array indices (python wraps c-int into object for each index). Second, python creates numpy structure for each slice/index. I think it's impossible to reduce concatenation time for this case in pure python, so I've decided to come up with a cython solution.
%%cython
import numpy as np
ctypedef long long int64
def concatenate(data, offsets, indices):
cdef int64[::] data_view = data
cdef int64[::] indices_view = indices
cdef int64[::] offsets_view = offsets
size = np.sum(offsets[indices + 1]) - np.sum(offsets[indices])
res = np.zeros(size, dtype=np.int64)
cdef int64[::] res_view = res
cdef int64 i, l = 0, r
for i in indices_view:
r = l + offsets_view[i + 1] - offsets_view[i]
res_view[l: r] = data_view[offsets_view[i]: offsets_view[i + 1]]
l = r
return res
%%timeit
concatenate(data, offsets, indices)
>>> 277 µs ± 89.8 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This solution is faster than the previous, but still a little slower than the original. But the biggest problem is that I don't know the data type in advance. I've used int64 in the example, but it could be any number type, e.g. float32. Because of that, I cannot use typed memory views as I did. In theory, I only need to know the size of the type (4/8 bytes) and if I have pointers to the data and result arrays, I can use memcpy or something similar to copy slices. But I don't know how to do it in cython. Is there a way to do it?
Here's my pure numpy-only solution, adv_concatenate()
function. It gives 15x-47x
times speedup (different on different machines) compared to regular np.concatenate()
.
NOTE: There is also a second faster solution going after the code of first one.
For time measurement for used pip module timerit
, install it once by python -m pip install timerit
. For timings two types of machines were used - first machine is Windows-based, it is same for all tests, it is my home laptop, second machine is Linux-based, it is different machine for every test (so different speed between different tests, but same speed inside one run/test), it is machine of repl.it site that I used to test my code too.
The idea of algorithm was to use numpy's cumulative sum function (.cumsum()
):
1
s, size of array equal to total resulting concatenated data array size. This array will hold indexes of all data
elements to be fetched to create resulting data array.cumsum()
on whole resulting array this starting values are transformed to starting offsets into data
array. The remaining values remain 1
s..cumsum()
. Now all values will hold correct indexes of data elements to be fetched.data
with formed above array of indexes.This algorithm can be potentially boosted even more if we pre-compute some values like offsets[1:] - offsets[:-1]
or offsets[:-1] + 1
and use them inside adv_concatenate()
function.
# Needs: python -m pip install numpy timerit
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups])
timer = lambda: Timerit(num = 600, verbose = 1)
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
def adv_concatenate(data, offsets, indices):
begs, ends = offsets[indices], offsets[indices + 1]
lens = ends - begs
clens = lens.cumsum()
ix = np.ones((clens[-1],), dtype = offsets.dtype)
ix[0] = begs[0]
ix[clens[:-1]] = begs[1:] - ends[:-1] + 1
ix = ix.cumsum()
return data[ix]
print('adv_concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv = adv_concatenate(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct
print('speedup:', round(tref / tadv, 3))
Outputs:
On first machine (Windows):
np.concatenate(): Timed best=3.129 ms, mean=3.225 +- 0.1 ms
adv_concatenate(): Timed best=191.137 us, mean=208.012 +- 20.7 us
speedup: 15.504
On second machine (Linux):
np.concatenate(): Timed best=1.666 ms, mean=2.314 +- 0.4 ms
adv_concatenate(): Timed best=35.596 us, mean=48.680 +- 15.4 us
speedup: 47.532
Second solution is even more faster than first one giving 40x-150x
times speedup (different on different machines) compared to regular np.concatenate()
. But second solution uses Numba JIT LLVM-based compiler, that needs to be installed via python -m pip install numba
.
Although it uses extra numba
package still central function adv_concatenate_indexes_numba()
is very simple, same amount of lines of code as in first solution. Also algorithm is much simpler just, two simple loops.
Current solution works for any data type, because central function just computes resulting indexes, hence doesn't work with data
's dtype at all. Also current solution can be boosted by 10%-90%
more if instead of computing indexes numba function will compute straight resulting data array, but this is only possible for quite simple data types that are supported by numba, including all number types. Here is the code (or here) for this improved solution which achieves up to 250x
speedup! Timings for this improved version on second machine (Linux):
np.concatenate(): Timed best=1.640 ms, mean=3.403 +- 1.9 ms
adv_concatenate_numba(): Timed best=12.669 us, mean=17.235 +- 6.9 us
speedup: 197.46
More generic (only-indexes-computing) solution's code is going next:
# Needs: python -m pip install numpy numba timerit
from timerit import Timerit
import numpy as np, numba
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
timer = lambda: Timerit(num = 600, verbose = 1)
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
@numba.njit('i8[:](i8[:], i8[:])', cache = True)
def adv_concatenate_indexes_numba(offsets, indices):
tlen = 0
for i in range(indices.size):
ix = indices[i]
tlen += offsets[ix + 1] - offsets[ix]
pos, r = 0, np.empty((tlen,), dtype = offsets.dtype)
for i in range(indices.size):
ix = indices[i]
for j in range(offsets[ix], offsets[ix + 1]):
r[pos] = j
pos += 1
return r
def adv_concatenate2(data, offsets, indices):
return data[adv_concatenate_indexes_numba(offsets, indices)]
adv_concatenate2(data, offsets, indices) # Once pre-compile Numba
print('adv_concatenate2(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv = adv_concatenate2(data, offsets, indices)
tadv = tim.mean()
assert np.array_equal(ref, adv) # Check that our solution is correct
print('speedup:', round(tref / tadv, 3))
Output:
On first machine (Windows):
np.concatenate(): Timed best=3.201 ms, mean=3.356 +- 0.1 ms
adv_concatenate2(): Timed best=79.681 us, mean=82.991 +- 6.7 us
speedup: 40.442
On second machine (Linux):
np.concatenate(): Timed best=1.541 ms, mean=2.220 +- 0.7 ms
adv_concatenate2(): Timed best=12.012 us, mean=14.830 +- 4.8 us
speedup: 149.716
Inspired by Cython code of @pavelgramovich answer I've also decided to implement my simplified version with loop (func concatenate1()
) instead of memcpy()
version (func concatenate0()
), simplified version appeared to be 1.5-2x
faster than memcpy version for current test data. Full code comparing both versions down below:
# Needs: python -m pip install numpy timerit cython setuptools
from timerit import Timerit
import numpy as np
np.random.seed(0)
Timerit._default_asciimode = True
groups = []
for i in range(100000):
size = np.random.randint(3) + 1
groups.append(np.random.randint(1000000, size = size, dtype = np.int64))
groups = np.array(groups) # dtype=np.object
indices = np.random.randint(len(groups), size = 1000, dtype = np.int64)
data = np.concatenate(groups)
offsets = np.cumsum([0] + [len(group) for group in groups], dtype = np.int64)
timer = lambda: Timerit(num = 600, verbose = 1)
def compile_cy_cats():
src = """
import numpy as np
cimport numpy as np
cimport cython
from libc.string cimport memcpy
@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate0(np.ndarray data, np.ndarray offsets, np.ndarray indices):
data = np.ascontiguousarray(data)
start_offsets = np.ascontiguousarray(offsets[indices], dtype=np.int64)
end_offsets = np.ascontiguousarray(offsets[indices + 1], dtype=np.int64)
cdef np.int64_t[::1] coffsets = start_offsets
cdef np.int64_t[::1] csizes = end_offsets - start_offsets
cdef np.int64_t i, total_size = 0
for i in range(csizes.shape[0]):
total_size += csizes[i]
res = np.empty(total_size, dtype=data.dtype)
cdef np.ndarray cdata = data
cdef np.ndarray cres = res
cdef np.int64_t itemsize = data.itemsize
cdef np.int64_t res_offset = 0
for i in range(csizes.shape[0]):
memcpy(cres.data + res_offset * itemsize,
cdata.data + coffsets[i] * itemsize,
csizes[i] * itemsize)
res_offset += csizes[i]
return res
@cython.boundscheck(False)
@cython.wraparound(False)
def concatenate1(np.int64_t[:] data, np.int64_t[:] offsets, np.int64_t[:] indices):
cdef np.int64_t tlen = 0, pos = 0, ix = 0, ixs = indices.size, i = 0, j = 0
for i in range(ixs):
ix = indices[i]
tlen += offsets[ix + 1] - offsets[ix]
r = np.empty(tlen, dtype = np.int64)
cdef np.int64_t[:] cr = r, cdata = data
for i in range(ixs):
ix = indices[i]
for j in range(offsets[ix], offsets[ix + 1]):
cr[pos] = cdata[j]
pos += 1
return r
"""
srcb = src.encode('utf-8')
import hashlib, os, glob, importlib
srch = hashlib.sha256(srcb).hexdigest().upper()[:8]
if len(glob.glob(f'cy{srch}*')) == 0:
with open(f'cys{srch}.pyx', 'wb') as f:
f.write(srcb)
import sys
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np
sys.argv += ['build_ext', '--inplace']
setup(
ext_modules = cythonize(
Extension(f'cy{srch}', [f'cys{srch}.pyx']), language_level = 3, annotate = True,
),
include_dirs = [np.get_include()],
)
del sys.argv[-2:]
print('Cython module:', f'cy{srch}')
return importlib.import_module(f'cy{srch}')
cy_cats = compile_cy_cats()
concatenate0, concatenate1 = cy_cats.concatenate0, cy_cats.concatenate1
print('np.concatenate(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
ref = np.concatenate([data[offsets[i] : offsets[i + 1]] for i in indices])
tref = tim.mean()
concatenate0(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate0(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv0 = concatenate0(data, offsets, indices)
tadv0 = tim.mean()
assert np.array_equal(ref, adv0) # Check that our solution is correct
print('speedup:', round(tref / tadv0, 3))
concatenate1(data, offsets, indices) # Maybe pre-heat
print('cy_concatenate1(): ', end = '', flush = True)
tim = timer()
for t in tim:
with t:
adv1 = concatenate1(data, offsets, indices)
tadv1 = tim.mean()
assert np.array_equal(ref, adv1) # Check that our solution is correct
print('speedup:', round(tref / tadv1, 3))
Output:
First machine (Windows):
Cython module: cy0BEBA0C8
np.concatenate(): Timed best=3.184 ms, mean=3.263 +- 0.1 ms
cy_concatenate0(): Timed best=119.767 us, mean=128.688 +- 10.7 us
speedup: 25.354
cy_concatenate1(): Timed best=86.525 us, mean=93.699 +- 20.5 us
speedup: 34.821
Second machine (Linux):
Cython module: cy0BEBA0C8
np.concatenate(): Timed best=1.630 ms, mean=2.215 +- 0.5 ms
cy_concatenate0(): Timed best=21.839 us, mean=28.930 +- 8.4 us
speedup: 76.578
cy_concatenate1(): Timed best=11.447 us, mean=15.263 +- 5.1 us
speedup: 145.151