Here is a nested loop, where the inner one indices depend on the outer one, with the following parameters:
f = rand(1,70299)
nech=24*30*24
N=length(f);
xh=(1:nech)/24;
In MATLAB:
sf2(1:nech)=0.;
sf2vel(1:nech)=0.;
count(1:nech)=0.;
for i=1:nech
for j=1:N-i-1
sf2(i)=sf2(i)+(f(j+i)-f(j))^2;
count(i)=count(i)+1;
end
sf2(i)=sf2(i)/count(i);
end
In Python:
def structFunPython(f,N,nech):
sf2 = np.zeros(N)
count = np.zeros(N)
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
With cython:
import cython
cimport numpy as np
import numpy as np
def structFun(np.ndarray f,N,nech):
cdef np.ndarray sf2 = np.zeros(N), count = np.zeros(N),
for i in range(nech):
indN = np.arange(1,N-i-1)
for j in indN:
sf2[i] += np.power((f[i+j]-f[j]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
Executions times:
Matlab: 7.8377 sec
Python: 3651.35 sec
Cython: 3336.21 sec
I have a hard time believing Python and Cython (especially Cython) are that slow for the same computation, so I think I must have made an error in my Python/Cython loops, but I can't see where.
Disclaimer: as @norok2 pointed out in a comment, my approach allocates at least N*(N-1)/2
doubles due to the use of pdist
. For your N = 70299
this means about 18.5 GB of doubles in an array. Other index arrays will have similar size. So unless some of your use cases have smaller N
, the vectorised approach in this answer is only feasible if you have a lot of memory.
As others have noted, just translating code from one language to another will not lead to optimal code in both languages. And using Cython alone is no guarantee for speed-up, in the same way that using NumPy alone is no guarantee for speed-up.
norok2's answer nicely shows you how you can use numba
or something similar to compile your numerical code. This can give you something quite similar to what you have in MATLAB in terms of performance, since MATLAB has a just-in-time (JIT) compiler of its own. There's also wiggle room to optimize your code for compilation, because multiple implementations can end up with wildly different performance.
Anyway, I want to make the point that you can also speed up your code by using advanced features in NumPy and SciPy. In particular, what you want is to compute pairwise squared distances among a set of 1d points. This is what scipy.spatial.distance.pdist
can do for you (with the 'sqeuclidean'
squared Euclidean norm). The upside is that it only compute each pairwise distance once (which is great for CPU and memory performance), but the downside is that picking out the contributions that you want to sum up a bit more cumbersome.
Anyway, here is the code corresponding to your Python implementation (with the fix that the inner loop uses np.arange(1, N-i)
rather than np.arange(1, N-i-1)
):
from scipy.spatial.distance import pdist
def pdisty(f, nech):
offset = f.size - nech
res = np.zeros_like(f, shape=nech)
dists = pdist(f[:, None], metric='sqeuclidean')
counts = np.arange(offset, f.size)[::-1]
inds = np.repeat(np.arange(res.size), counts)
np.add.at(res, inds, dists[:inds.size])
res /= counts
return res
What happens here is that
dists
counts
dists
that it corresponds to (this is the hard part), store it in inds
np.add.at
to accumulate each contribution to the appropriate output indexHere are some timings for N = 1000
, where func2()
is the corresponding function from norok2's answer:
>>> %timeit structFunPython(f, f.size - 1)
... %timeit func2(f, f.size - 1)
... %timeit pdisty(f, f.size - 1)
1.48 s ± 89.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
274 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
36.7 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The above solution is considerably faster, but of course it's still slower than a fully compiled solution. If you have an issue with additional dependencies or getting llvm installed on your system, this might be a reasonable compromise. The bottom line is that code should be adapted to the language you're trying to optimise it in.
For completeness' sake here are the implementations I used for the comparison (I slightly changed the signatures because N
can be computed from the input array, and I fixed some fencepost errors):
def structFunPython(f, nech):
"""Very slightly modified from the question"""
N = f.size
sf2 = np.zeros(nech)
count = np.zeros(nech)
for i in range(nech):
indN = np.arange(1,N-i)
for j in indN:
sf2[i] += np.power((f[i+j]-f[i]),2)
count[i] += 1
sf2[i] = sf2[i]/count[i]
return sf2
def func2(f_arr, nech):
"""Very slightly modified from norok2's answer
See https://stackoverflow.com/a/71704834/5067311
"""
n = f_arr.size
sf2 = np.zeros(nech)
for i in range(nech):
for j in range(1, n - i):
sf2[i] += (f_arr[i + j] - f_arr[i]) ** 2
sf2[i] /= (n - i - 1)
return sf2
With these definitions all three functions give the same result within machine precision:
rng = np.random.default_rng()
N = 1000
nech = N - 2
f = rng.random(1000)
assert np.allclose(structFunPython(f, nech), func2(f, nech))
assert np.allclose(structFunPython(f, nech), pdisty(f, nech))