pythonperformancevectorizationnumpy-ndarraynumba

# performance improvement on numpy ndarray columnar calculation (row reduction)

I'm doing row reduction on 3 dimensional ndarray (KxMxN), i.e., taking all values of a column and use a reduce function to produce a scalar value; eventually a KxMxN matrix would become a 2 dimensional ndarray of order KxN.

What I am trying to solve:

1. Take in a 3D-matrix and split it into 16 sub-matrices (or any even number of splits larger than 2, 16 is the optimal one);
2. Generate all combinations by selecting half the sub-matrices (8) from the 16 pieces and join these 8 sub-matrices into one matrix; therefore there's one matrix for every combination;
3. Calculate a scalar value from reduce function for each column of the new matrix, for all combinations, ideally at once.

There's more implementation details, I'll explain along the way.

The 3-D ndarray is of float numbers.

In following example, `njit` with `numpy` is the best I could get for the moment. I am wondering if there's any room for further improvement, from any angle.

`cupy`(GPU parallelization), `dask` (CPU parallelization) or `numba` parallelization all failed to beat the following (my usecase apparently is way too insignificant to leverage the power of GPU and I've only got 8G GPU). There's good chance that these tools can be used in a much more advanced way, which I didn't know.

``````from numba import njit, guvectorize, float64, int64
from math import sqrt
import numba as nb
import numpy as np
import itertools

# Create a 2D ndarray
m = np.random.rand(800,100)
# Reshape it into a list of sub-matrices
mr = m.reshape(16,50,100)

# Create an indices matrix from combinatorics
# a typical one for me "select 8 from 16", 12870 combinations
# I do have a custom combination generator, but this is not what I wanted to optimise and itertools really has done a decent job already.
x = np.array( list(itertools.combinations(np.arange(16),8)) )

# Now we are going to select 8 sub-matrices from `mr` and reshape them to become one bigger sub-matrix; we do this in list comprehension.
# This is the matrix we are going to reduce.

# Bottleneck 1: This line takes the longest and I'd hope to improve on this line, but I am not sure there's much we could do here.

m3d = np.array([mr[idx_arr].reshape(400,100) for idx_arr in x])

# We create different versions of the same reduce function.

# Bottleneck 2: The reduce function is another place I'd want to improve on.

# col - column values
# days - trading days in a year
# rf - risk free rate

# njit version with instance function `mean`, `std`, and python `sqrt`
@njit
def nb_sr(col, days, rf):
mean = (col.mean() * days) - rf
std = col.std() * sqrt(days)
return mean / std

# njit version with numpy
@njit
def nb_sr_np(col, days, rf):
mean = (np.mean(col) * days) -rf
std = np.std(col) * np.sqrt(days)
return mean / std

# guvectorize with numpy
@guvectorize([(float64[:],int64,float64,float64[:])], '(n),(),()->()', nopython=True)
def gu_sr_np(col,days,rf,res):
mean = (np.mean(col) * days) - rf
std = np.std(col) * np.sqrt(days)
res[0] = mean / std

# We wrap them such that they can be applied on 2-D matrix with list comprehension.

# Bottleneck 3: I was thinking to probably vectorize this wrapper, but the closest I can get is list comprehension, which isn't really vectorization.

def nb_sr_wrapper(m2d):
return [nb_sr(r, 252, .25) for r in m2d.T]

def nb_sr_np_wrapper(m2d):
return [nb_sr_np(r, 252, .25) for r in m2d.T]

def gu_sr_np_wrapper(m2d):
return [gu_sr_np(r, 252, .25) for r in m2d.T]

# Finally! here's our performance benchmarking step.

%timeit np.array( [nb_sr_wrapper(m) for m in m3d.T] )
# output: 4.26 s ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [nb_sr_np_wrapper(m) for m in m3d.T] )
# output: 4.33 s ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [gu_sr_np_wrapper(m) for m in m3d.T] )
# output: 6.06 s ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

Solution

• TL;DR The final version below doesn't need `m3d`. Instead it uses `mr` and `x` directly. That achieves an 88x speedup for steps 1,2,3 (or 54x speedup over steps 2 and 3 only). The timing of that last solution is 78ms.

Bottleneck 1 is relatively easy to address:

``````a, b = x.shape
_, _, N = mr.shape

new_m3d =  mr[x].reshape(a, -1, N)

>>> np.array_equal(m3d, new_m3d)
True
``````

It's about 3.5x faster on the dimensions provided (`16, 50, 100` for `mr` and `12870, 8` for `x`).

For bottlenecks #2 and #3, the functionality can be expressed simpler. We do a slight refactoring of the function done by `nb_sr()`:

Where `v` is one "column" of `m3d`, e.g. `m3d[0, :, 0]`.

We then use `numpy` to do the `mean` and the `std` on the entire `m3d` in one shot, and we obtain the simpler (and faster) expression:

``````out2 = ((np.mean(m3d, axis=1) - rf/days) / np.std(m3d, axis=1)).T * sqrt(days)

>>> np.allclose(out, out2)
True
``````

In terms of timing, this is about 2x faster than `np.array([nb_sr_wrapper(m) for m in m3d.T])`, as all looping is done inside `numpy` (no Python loops, not even in list comprehension).

### Breaking standard deviation in parts

However, we observe that most of the time in `out2` calculation is spent in `np.std(m3d, axis=1)`. Let us try to speed that up. When thinking about that operation and how `m3d` was created in the first place (each `v` being a concatenation of several rows of `mr` according to the combinations in `x`), we see that there are lots of repeated calculations. For example, with the size parameters of the OP, we have:

``````>>> x[0]
array([0, 1, 2, 3, 4, 5, 6, 7])

>>> x[1]
array([0, 1, 2, 3, 4, 5, 6, 8])
``````

Each `x[i]` denotes 8 slices in `mr` that end up concatenated in `m3d` (with OP's sizes, each `x[i]` has 8 elements, each `mr[x[i]]` has shape `(8, 50, 100)` and is concatenated into `m3d[i]` with shape `(400, 100)`; `np.std(m3d, axis=1)` computes the standard deviation of each `12870 * 100` vectors `v` of size `400`). The goal is therefore to express `std(v)` from the constituents of `v`, e.g. the 8 slices in `mr` indexed by `x`.

The standard deviation (with `p` degrees of freedom) is:

And now, we can express each of these sums from the partial sums of the constituents which we compute only once. Our new `std` calculation doesn't use `m3d`, only `mr` and `x`, and is:

``````def optstd(mr, x, ddof=0):
n = mr.shape[1] * x.shape[1] - ddof
m2 = np.sum(mr**2, axis=1) / n
m = np.sum(mr, axis=1) / n
return np.sqrt(m2[x].sum(1) - m[x].sum(1)**2)

out3 = ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)

>>> np.allclose(out, out3)
True
``````

Now we get some real speedup:

``````%timeit ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)
358 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

That is about 12x faster than `np.array( [nb_sr_wrapper(m) for m in m3d.T] )`.

### Final version

Finally, we note that the above still does repeated operations for the mean of `m3d` along `axis=1`. With the same reasoning as for `std` above, we can break the mean into pieces and reuse components. In fact, we can simply reuse `m` in the optimized `std` calculation, and combine everything in a single function that does both `mean` and `std` directly from `mr` and `x` and in optimized fashion. This allows us to get rid of `m3d` entirely:

``````def optcalc(mr, x, days, rf, ddof=0):
n = mr.shape[1] * x.shape[1]
m2 = np.sum(mr**2, axis=1)
m = np.sum(mr, axis=1)
A = m[x].sum(1) / n
B = np.sqrt(m2[x].sum(1) / (n - ddof) - m[x].sum(1)**2 / (n - ddof)**2)
return ((A - rf/days) / B).T * sqrt(days)

out4 = optcalc(mr, x, days, rf)

>>> np.allclose(out, out4)
True

%timeit optcalc(mr, x, days, rf)
# 78.1 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
``````

The timing is now 54x faster than the original for `out`, and we don't need `m3d` for it. So, counting the time for that 1st step, the overall speedup is 88.89x.

Why is this faster?

1. No Python loops.
2. Move multiplications and subtraction as far "outside" as possible, where the dimensions are reduced already.
3. Instead of doing `std` on all the combinations of columns, we break it up in parts to reuse as much as possible the partial sums made on the original `mr` columns and calculate ourselves the equivalent `std` on the concatenations.
4. In the final proposal, `optcalc` applies the same idea to `mean` and results in a calculation that can do away with `m3d` altogether.

### Postscriptum

In comments, a question was made about how to parallelize this and if it would be worth it. The short answer is: I don't know; I can see several possibilities, but not an obvious winner.

I wanted to show what's possible with just using `numpy` (not even `numba`) by thinking about what operations are. Often, it is easier and gives better performance to optimize for single thread, and then allow your machine to process multiple such problems in parallel.

BTW, one possible optimization (still single thread) to the above would be to avoid computing `x`, and rewrite the combinations to take advantage of the recursive structure of the overall calculation. For example, with the OP's problem size, there are 495 rows of `x` starting with `[0,1,2,3]`, and those are among the 1287 starting with `[0,1,2]`: one could exploit that structure for another speedup (possibly requiring the use of `numba` as some looping would then have to be made explicitly).