I have a function that does the following:
A very straightforward way i started was with the following code using np.concatenate:
import numpy as np
from numba import njit
arr = np.ones(10000)
@njit(fastmath=True)
def EQ1(a):
eq1 = 2*a[1:-1]
eq2 = 4*a[0:2]
sys = (eq1,eq2)
sys = np.concatenate(sys)
return sys
Then i decided to look for faster answers, and oddly, doing a numba jitted function appending each individual element to a pre-allocated array is significantly faster:
@njit(fastmath=True)
def EQ2(a):
sys = np.empty_like(a)
l = 0
for i in range(1,len(a)-1):
eq1 = 2*a[i]
sys[l] = eq1
l+=1
for i in range(2):
eq2 = 4*a[i]
sys[l] = eq2
l+=1
return sys
%timeit EQ1(arr) 6.31 µs ± 9.72 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit EQ2(arr) 3.48 µs ± 2.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
But the second approach starts occupying a lot of space as the number of eqs grows. So i'm wondering there are any faster alternatives. I already experimented with append(), extend() but the for loop jitted is still faster. Thanks
You're not exactly asking about catenation speed. Rather, to accomplish a task, you are asking "how many memory operations do I need?" With the goal of being lazy, of doing as little as possible.
We're doing essentially no computation here; it is entirely an exercise focused on moving chunks of memory around.
These operations have negligible cost and won't be considered further:
eq2 = 4 * a[0:2]
sys = (eq1, eq2)
We will take "writing a bunch of ones",
arr = np.ones(10_000)
, as "one memory operation",
an array's worth of writes across the bus to DRAM.
It is my belief that it happens in a sensible way,
with each cache line being written just once.
So that is our timing unit.
eq1 = 2 * a[1:-1]
Here we read one unit, which thanks to caching seems to have zero cost in a run-many-times benching context. We trivially double each item, and send that to a temp var at a cost of one unit.
sys = np.concatenate(sys)
Now concatenate
reads one (cached) unit
at zero cost,
and incurs the expense of an additional
unit of writing to newly allocated storage.
Total: two units.
The range(2)
loop has negligible cost
and won't be considered further.
Similarly the .empty_like()
.
for i in range(1, len(a) - 1):
eq1 = 2 * a[i]
sys[l] = eq1
l += 1
Reading a
from cache we get for free, pretty much.
Storing double that to sys
costs us one unit.
Multiplying by two is not a bottleneck
in this pipeline, relative to memory
operations.
Buffering the result in a register
for a moment, rather than in DRAM
for a while, is a big win.
So the temp vars of EQ1 cost two units of memory operations, while EQ2 cost us just one unit.
Moral of the story: "big" temp vars are costly, as they aren't cache friendly.