Rewriting MATLAB code with nested loops in Python with fast execution

Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:

f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;

In MATLAB:

sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;

for i=1:nech
    for j=1:N-i-1
        sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
        count(i)=count(i)+1; 
    end
    sf2(i)=sf2(i)/count(i);
end

In Python:

def structFunPython(f,N,nech):
    sf2 = np.zeros(N)
    count = np.zeros(N)
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

With cython:

import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
    cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
    for i in range(nech):
        indN = np.arange(1,N-i-1)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[j]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2

Executions times:

Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec

I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.

Solution

Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2 doubles due to the use of pdist. For your N = 70299 this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N, the vectorised approach in this answer is only feasible if you have a lot of memory.

As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.

norok2's answer nicely shows you how you can use numba or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.

Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist can do for you (with the 'sqeuclidean' squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.

Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i) rather than np.arange(1, N-i-1)):

from scipy.spatial.distance import pdist

def pdisty(f, nech):
    offset = f.size - nech
    res = np.zeros_like(f, shape=nech)
    dists = pdist(f[:, None], metric='sqeuclidean')
    counts = np.arange(offset, f.size)[::-1]
    inds = np.repeat(np.arange(res.size), counts)
    np.add.at(res, inds, dists[:inds.size])
    res /= counts
    return res

What happens here is that

we compute the pairwise distance of each unique pair of array values and store it to dists
we compute the number of pairs each point is involved in (this is what we'll have to normalise at the end), store it to counts
figure out the 1d index of each value in dists that it corresponds to (this is the hard part), store it in inds
use np.add.at to accumulate each contribution to the appropriate output index
normalise with the counts.

Here are some timings for N = 1000, where func2() is the corresponding function from norok2's answer:

>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.

For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N can be computed from the input array, and I fixed some fencepost errors):

def structFunPython(f, nech):
    """Very slightly modified from the question"""
    N = f.size
    sf2 = np.zeros(nech)
    count = np.zeros(nech)
    for i in range(nech):
        indN = np.arange(1,N-i)
        for j in indN:
            sf2[i] += np.power((f[i+j]-f[i]),2)
            count[i] += 1
        sf2[i] = sf2[i]/count[i]
    return sf2


def func2(f_arr, nech):
    """Very slightly modified from norok2's answer

    See https://stackoverflow.com/a/71704834/5067311

    """
    n = f_arr.size
    sf2 = np.zeros(nech)
    for i in range(nech):
        for j in range(1, n - i):
            sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
        sf2[i] /= (n - i - 1)
    return sf2

With these definitions all three functions give the same result within machine precision:

rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))