Efficient computation of moving linear regression with numpy/numba

I am trying to create a moving linear regression indicator and I wanted to utilize numba. However, I am struggling with the latter part as I am lacking experience

Here's what I have so far utilizing numpy. It is working, however, without applying numba it is quite slow once you throw large arrays at it.

import numpy as np


def ols_1d(y, window):
    y_roll = np.lib.stride_tricks.sliding_window_view(y, window_shape=window)

    m = list()
    c = list()
    for row in np.arange(y_roll.shape[0]):
        A = np.vstack([np.arange(1, window + 1), np.ones(window)]).T
        tmp_m, tmp_c = np.linalg.lstsq(A, y_roll[row], rcond=None)[0]
        m.append(tmp_m)
        c.append(tmp_c)

    m, c = np.array([m, c])

    return np.hstack((np.full((window - 1), np.nan), m * window + c))


def ols_2d(y, window):
    out = list()
    for col in range(y.shape[1]):
        out.append(ols_1d(y=y[:, col], window=window))

    return np.array(out).T


if __name__ == "__main__":
    a = np.random.randn(
        10000, 10
    )  # function is slow once you really increse number of columns (let's say 1 mln)

    print(ols_2d(a, 10))

The indicator is actually computing a linear regression applying np.linalg.lstsq function (https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html) for a given window length. It than basically outputs the very last point of the regression line and moves on to the next range and computes the linear regression again. At the end of the day, ols_1d outputs the last point of every individual regression line putting it into an array.

Now, I need help in order to apply numba on top of it. I am not familiar with numba, but as far as my own trial and error goes, there might be an issue utilizing nb.lib.stride_tricks.sliding_window_view().

EDIT: Follow-up question

Using @aerobiomat's suggestion, I only needed to slightly modify np.cumsum in order to account for matrices rather then vectors.

def window_sum(x, w):
    c = np.cumsum(x, axis=0)  # inserted axis argument
    s = c[w - 1:]
    s[1:] -= c[:-w]
    return s

def window_lin_reg(x, y, w):
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x**2, w)
    sxy = window_sum(x * y, w)
    slope = (w * sxy - sx * sy) / (w * sx2 - sx**2)
    intercept = (sy - slope * sx) / w
    return slope, intercept

Now, let's create a 20x3 matrix, where columns represent individual time series.

timeseries_count = 3
x = np.arange(start=1,stop=21).reshape(-1, 1)
y = np.random.randn(20, timeseries_count)

slope, intercept = window_lin_reg(x, y, 10)

This is working fine. However, as soon as I introduce some np.nan, I am running into some issues.

y[0, 0] = np.nan
y[5, 0] = np.nan
y[10, 2] = np.nan

In order to compute the rolling regression, I need to drop all np.nan in each column and compute it column-wise. This does go against vectorization, doesn't it? Every column in the real dataset may contain a different number of np.nan. How to cope with that issue in a clever way? I need speed as the dataset may be quite large (10000 x 10000 or so).

Solution

Instead of calculating the linear regression for each window, which involves repeating many intermediate calculations, you can compute the values needed by the formula for all windows and perform a vectorized calculation of all regressions:

def window_sum(x, w):
    # Faster than np.lib.stride_tricks.sliding_window_view(x, w).sum(axis=0)
    c = np.cumsum(x)
    s = c[w - 1:]
    s[1:] -= c[:-w]
    return s

def window_lin_reg(x, y, w):
    sx = window_sum(x, w)
    sy = window_sum(y, w)
    sx2 = window_sum(x**2, w)
    sxy = window_sum(x * y, w)
    slope = (w * sxy - sx * sy) / (w * sx2 - sx**2)
    intercept = (sy - slope * sx) / w
    return slope, intercept

For example:

>>> w = 5      # Window
>>> x = np.arange(15)
>>> y = 0.1 * x**2 - x + 10
>>> slope, intercept = window_lin_reg(x, y, w)
>>> print(slope)
>>> print(intercept)
[-0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8  1.   1.2  1.4]
[ 9.8  9.3  8.6  7.7  6.6  5.3  3.8  2.1  0.2 -1.9 -4.2]

Comparing with np.linalg.lstsq() and loops:

def numpy_lin_reg(x, y, w):
    m = len(x) - w + 1
    slope = np.empty(m)
    intercept = np.empty(m)
    for i in range(m):
        A = np.vstack(((x[i:i + w]), np.ones(w))).T
        m, c = np.linalg.lstsq(A, y[i:i + w], rcond=None)[0]
        slope[i] = m
        intercept[i] = c
    return slope, intercept

Same example, same results:

>>> slope2, intercept2 = numpy_lin_reg(x, y, w)
>>> with np.printoptions(precision=2, suppress=True):
...     print(np.array(slope2))
...     print(np.array(intercept2))
[-0.6 -0.4 -0.2  0.   0.2  0.4  0.6  0.8  1.   1.2  1.4]
[ 9.8  9.3  8.6  7.7  6.6  5.3  3.8  2.1  0.2 -1.9 -4.2]

Some timings with a larger example:

>>> w = 20
>>> x = np.arange(10000)
>>> y = 0.1 * x**2 - x + 10

>>> %timeit numpy_lin_reg(x, y, w)
324 ms ± 11.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit window_lin_reg(x, y, w)
189 µs ± 3.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's three orders of magnitude of performance increase. The functions are already vectorized, so there's little Numba can do. "Only" twice as fast when the functions are decorated with @nb.njit:

>>> %timeit window_lin_reg(x, y, w)
96.4 µs ± 350 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)