I have parallalised some Cython code with OpenMP. Once in a while, the code computes wrong results.
I created a nearly minimal working example of my problem. "Nearly", because the frequency of the wrong results seem to depend on even the tiniest changes in code, thus, e.g. I kept the function pointers in.
The Cython code is
#cython: language_level=3, boundscheck=False, wraparound=False, cdivision=True
# distutils: language = c++
import numpy as np
cimport cython
from cython.parallel import prange, parallel
from libcpp.vector cimport vector
cimport numpy as np
cdef inline double estimator_matheron(const double f_diff) nogil:
return f_diff * f_diff
ctypedef double (*_estimator_func)(const double) nogil
cdef inline void normalization_matheron(
vector[double]& variogram,
vector[long]& counts,
const int variogram_len
):
cdef int i
for i in range(variogram_len):
if counts[i] == 0:
counts[i] = 1
variogram[i] /= (2. * counts[i])
ctypedef void (*_normalization_func)(vector[double]&, vector[long]&, const int)
def test(const double[:] f):
cdef _estimator_func estimator_func = estimator_matheron
cdef _normalization_func normalization_func = normalization_matheron
cdef int i_max = f.shape[0] - 1
cdef int j_max = i_max + 1
cdef vector[double] variogram_local, variogram
cdef vector[long] counts_local, counts
cdef int i, j
with nogil, parallel():
variogram_local.resize(j_max, 0.0)
counts_local.resize(j_max, 0)
for i in range(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += estimator_func(f[i] - f[i+j])
normalization_func(variogram_local, counts_local, j_max)
return np.asarray(variogram_local)
To test the code, I used this script:
import numpy as np
from cython_parallel import test
z = np.array(
(41.2, 40.2, 39.7, 39.2, 40.1, 38.3, 39.1, 40.0, 41.1, 40.3),
dtype=np.double,
)
print(test(z))
The result should be
[0. 0.49166667 0.7625 1.09071429 0.90166667 1.336
0.9525 0.435 0.005 0.405 ]
This is what a wrong result typically looks like
[0. 0.44319444 0.75483871 1.09053571 0.90166667 1.336
0.9525 0.435 0.005 0.405 ]
This code mainly sums up numbers into the the vector variogram_local
. Most of the time, this code works, but without having made sufficient statistics, wrong results are produced maybe every 30th time. It always works, if I change the line with nogil, parallel():
to with nogil:
. It also always works, if I don't use the function pointers at all, like this:
with nogil, parallel():
variogram_local.resize(j_max, 0.0)
counts_local.resize(j_max, 0)
for i in range(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])
for j in range(j_max):
if counts_local[j] == 0:
counts_local[j] = 1
variogram_local[j] /= (2. * counts_local[j])
return np.asarray(variogram_local)
The full code is tested on different platforms and these problems mainly occure on MacOS with clang, e.g.:
https://ci.appveyor.com/project/conda-forge/staged-recipes/builds/29018878
EDIT
Thanks to your input, I modified the code and with num_threads=2
it works. But as soon as num_threads>2
I get wrong results again. Do you think that, if Cython support for OpenMP would be perfect, my new code should work or am I still getting something wrong?
If this should be on Cython's side, I guess I will indeed implement the code in pure C++.
def test(const double[:] f):
cdef int i_max = f.shape[0] - 1
cdef int j_max = i_max + 1
cdef vector[double] variogram_local, variogram
cdef vector[long] counts_local, counts
cdef int i, j, k
variogram.resize(j_max, 0.0)
counts.resize(j_max, 0)
with nogil, parallel(num_threads=2):
variogram_local = vector[double](j_max, 0.0)
counts_local = vector[long)(j_max, 0)
for i in prange(i_max):
for j in range(1, j_max-i):
counts_local[j] += 1
variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])
for k in range(j_max):
counts[k] += counts_local[k]
variogram[k] += variogram_local[k]
for i in range(j_max):
if counts[i] == 0:
counts[i] = 1
variogram[i] /= (2. * counts[i])
return np.asarray(variogram)
Contrary to their name, variogram_local
and counts_local
are not actually local. They are shared and all threads mess around with them in parallel, hence the undefined result.
Note that you don't actually share any work. It's just all threads doing the same thing - the whole serial task.
A somewhat sensible parallel version would look more like this:
variogram.resize(j_max, 0.0)
counts.resize(j_max, 0)
with nogil, parallel():
for i in range(i_max):
for j in prange(1, j_max-i):
counts[j] += 1
variogram[j] += estimator_func(f[i] - f[i+j])
The shared arrays are initialized outside and then the threads share the inner j
-loop. Since no two threads will ever work on the same j
, this is safe to do.
Now it may not be ideal to parallelize the inner loop. If you were to actually parallelize the outer loop, you would have to in fact make actual local variables and merge/reduce them afterwards.