python performance vectorization numpy-ndarray numba

performance improvement on numpy ndarray columnar calculation (row reduction)

I'm doing row reduction on 3 dimensional ndarray (KxMxN), i.e., taking all values of a column and use a reduce function to produce a scalar value; eventually a KxMxN matrix would become a 2 dimensional ndarray of order KxN.

What I am trying to solve:

Take in a 3D-matrix and split it into 16 sub-matrices (or any even number of splits larger than 2, 16 is the optimal one);
Generate all combinations by selecting half the sub-matrices (8) from the 16 pieces and join these 8 sub-matrices into one matrix; therefore there's one matrix for every combination;
Calculate a scalar value from reduce function for each column of the new matrix, for all combinations, ideally at once.

There's more implementation details, I'll explain along the way.

The 3-D ndarray is of float numbers.

In following example, njit with numpy is the best I could get for the moment. I am wondering if there's any room for further improvement, from any angle.

cupy(GPU parallelization), dask (CPU parallelization) or numba parallelization all failed to beat the following (my usecase apparently is way too insignificant to leverage the power of GPU and I've only got 8G GPU). There's good chance that these tools can be used in a much more advanced way, which I didn't know.

from numba import njit, guvectorize, float64, int64
from math import sqrt
import numba as nb
import numpy as np
import itertools


# Create a 2D ndarray
m = np.random.rand(800,100)
# Reshape it into a list of sub-matrices
mr = m.reshape(16,50,100)

# Create an indices matrix from combinatorics 
# a typical one for me "select 8 from 16", 12870 combinations
# I do have a custom combination generator, but this is not what I wanted to optimise and itertools really has done a decent job already.
x = np.array( list(itertools.combinations(np.arange(16),8)) )

# Now we are going to select 8 sub-matrices from `mr` and reshape them to become one bigger sub-matrix; we do this in list comprehension.
# This is the matrix we are going to reduce.

# Bottleneck 1: This line takes the longest and I'd hope to improve on this line, but I am not sure there's much we could do here.

m3d = np.array([mr[idx_arr].reshape(400,100) for idx_arr in x])

# We create different versions of the same reduce function. 

# Bottleneck 2: The reduce function is another place I'd want to improve on.

# col - column values
# days - trading days in a year
# rf - risk free rate

# njit version with instance function `mean`, `std`, and python `sqrt`
@njit
def nb_sr(col, days, rf):
    mean = (col.mean() * days) - rf
    std = col.std() * sqrt(days)
    return mean / std

# njit version with numpy
@njit
def nb_sr_np(col, days, rf):
    mean = (np.mean(col) * days) -rf
    std = np.std(col) * np.sqrt(days)
    return mean / std

# guvectorize with numpy
@guvectorize([(float64[:],int64,float64,float64[:])], '(n),(),()->()', nopython=True)
def gu_sr_np(col,days,rf,res):
    mean = (np.mean(col) * days) - rf
    std = np.std(col) * np.sqrt(days)
    res[0] = mean / std

# We wrap them such that they can be applied on 2-D matrix with list comprehension.

# Bottleneck 3: I was thinking to probably vectorize this wrapper, but the closest I can get is list comprehension, which isn't really vectorization.

def nb_sr_wrapper(m2d):
    return [nb_sr(r, 252, .25) for r in m2d.T]

def nb_sr_np_wrapper(m2d):
    return [nb_sr_np(r, 252, .25) for r in m2d.T]

def gu_sr_np_wrapper(m2d):
    return [gu_sr_np(r, 252, .25) for r in m2d.T]

# Finally! here's our performance benchmarking step.

%timeit np.array( [nb_sr_wrapper(m) for m in m3d.T] )
# output: 4.26 s ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [nb_sr_np_wrapper(m) for m in m3d.T] )
# output: 4.33 s ± 26.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.array( [gu_sr_np_wrapper(m) for m in m3d.T] )
# output: 6.06 s ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution

TL;DR The final version below doesn't need m3d. Instead it uses mr and x directly. That achieves an 88x speedup for steps 1,2,3 (or 54x speedup over steps 2 and 3 only). The timing of that last solution is 78ms.

Bottleneck 1 is relatively easy to address:

a, b = x.shape
_, _, N = mr.shape

new_m3d =  mr[x].reshape(a, -1, N)

>>> np.array_equal(m3d, new_m3d)
True

It's about 3.5x faster on the dimensions provided (16, 50, 100 for mr and 12870, 8 for x).

For bottlenecks #2 and #3, the functionality can be expressed simpler. We do a slight refactoring of the function done by nb_sr():

Where v is one "column" of m3d, e.g. m3d[0, :, 0].

We then use numpy to do the mean and the std on the entire m3d in one shot, and we obtain the simpler (and faster) expression:

out2 = ((np.mean(m3d, axis=1) - rf/days) / np.std(m3d, axis=1)).T * sqrt(days)

>>> np.allclose(out, out2)
True

In terms of timing, this is about 2x faster than np.array([nb_sr_wrapper(m) for m in m3d.T]), as all looping is done inside numpy (no Python loops, not even in list comprehension).

Breaking standard deviation in parts

However, we observe that most of the time in out2 calculation is spent in np.std(m3d, axis=1). Let us try to speed that up. When thinking about that operation and how m3d was created in the first place (each v being a concatenation of several rows of mr according to the combinations in x), we see that there are lots of repeated calculations. For example, with the size parameters of the OP, we have:

>>> x[0]
array([0, 1, 2, 3, 4, 5, 6, 7])

>>> x[1]
array([0, 1, 2, 3, 4, 5, 6, 8])

Each x[i] denotes 8 slices in mr that end up concatenated in m3d (with OP's sizes, each x[i] has 8 elements, each mr[x[i]] has shape (8, 50, 100) and is concatenated into m3d[i] with shape (400, 100); np.std(m3d, axis=1) computes the standard deviation of each 12870 * 100 vectors v of size 400). The goal is therefore to express std(v) from the constituents of v, e.g. the 8 slices in mr indexed by x.

The standard deviation (with p degrees of freedom) is:

And now, we can express each of these sums from the partial sums of the constituents which we compute only once. Our new std calculation doesn't use m3d, only mr and x, and is:

def optstd(mr, x, ddof=0):
    n = mr.shape[1] * x.shape[1] - ddof
    m2 = np.sum(mr**2, axis=1) / n
    m = np.sum(mr, axis=1) / n
    return np.sqrt(m2[x].sum(1) - m[x].sum(1)**2)

out3 = ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)

>>> np.allclose(out, out3)
True

Now we get some real speedup:

%timeit ((np.mean(m3d, axis=1) - rf/days) / optstd(mr, x)).T * sqrt(days)
358 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

That is about 12x faster than np.array( [nb_sr_wrapper(m) for m in m3d.T] ).

Final version

Finally, we note that the above still does repeated operations for the mean of m3d along axis=1. With the same reasoning as for std above, we can break the mean into pieces and reuse components. In fact, we can simply reuse m in the optimized std calculation, and combine everything in a single function that does both mean and std directly from mr and x and in optimized fashion. This allows us to get rid of m3d entirely:

def optcalc(mr, x, days, rf, ddof=0):
    n = mr.shape[1] * x.shape[1]
    m2 = np.sum(mr**2, axis=1)
    m = np.sum(mr, axis=1)
    A = m[x].sum(1) / n
    B = np.sqrt(m2[x].sum(1) / (n - ddof) - m[x].sum(1)**2 / (n - ddof)**2)
    return ((A - rf/days) / B).T * sqrt(days)

out4 = optcalc(mr, x, days, rf)

>>> np.allclose(out, out4)
True

%timeit optcalc(mr, x, days, rf)
# 78.1 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The timing is now 54x faster than the original for out, and we don't need m3d for it. So, counting the time for that 1st step, the overall speedup is 88.89x.

Why is this faster?

No Python loops.
Move multiplications and subtraction as far "outside" as possible, where the dimensions are reduced already.
Instead of doing std on all the combinations of columns, we break it up in parts to reuse as much as possible the partial sums made on the original mr columns and calculate ourselves the equivalent std on the concatenations.
In the final proposal, optcalc applies the same idea to mean and results in a calculation that can do away with m3d altogether.

Postscriptum

In comments, a question was made about how to parallelize this and if it would be worth it. The short answer is: I don't know; I can see several possibilities, but not an obvious winner.

I wanted to show what's possible with just using numpy (not even numba) by thinking about what operations are. Often, it is easier and gives better performance to optimize for single thread, and then allow your machine to process multiple such problems in parallel.

BTW, one possible optimization (still single thread) to the above would be to avoid computing x, and rewrite the combinations to take advantage of the recursive structure of the overall calculation. For example, with the OP's problem size, there are 495 rows of x starting with [0,1,2,3], and those are among the 1287 starting with [0,1,2]: one could exploit that structure for another speedup (possibly requiring the use of numba as some looping would then have to be made explicitly).