python parallel-processing openmp cython

Can this problem be implemented in parallel in Cython with OpenMP?

I have parallalised some Cython code with OpenMP. Once in a while, the code computes wrong results.

I created a nearly minimal working example of my problem. "Nearly", because the frequency of the wrong results seem to depend on even the tiniest changes in code, thus, e.g. I kept the function pointers in.

The Cython code is

#cython: language_level=3, boundscheck=False, wraparound=False, cdivision=True
# distutils: language = c++

import numpy as np

cimport cython
from cython.parallel import prange, parallel
from libcpp.vector cimport vector
cimport numpy as np

cdef inline double estimator_matheron(const double f_diff) nogil:
    return f_diff * f_diff

ctypedef double (*_estimator_func)(const double) nogil

cdef inline void normalization_matheron(
    vector[double]& variogram,
    vector[long]& counts,
    const int variogram_len
):
    cdef int i
    for i in range(variogram_len):
        if counts[i] == 0:
            counts[i] = 1
        variogram[i] /= (2. * counts[i])

ctypedef void (*_normalization_func)(vector[double]&, vector[long]&, const int)


def test(const double[:] f):
    cdef _estimator_func estimator_func = estimator_matheron
    cdef _normalization_func normalization_func = normalization_matheron

    cdef int i_max = f.shape[0] - 1
    cdef int j_max = i_max + 1

    cdef vector[double] variogram_local, variogram
    cdef vector[long] counts_local, counts

    cdef int i, j

    with nogil, parallel():
        variogram_local.resize(j_max, 0.0)
        counts_local.resize(j_max, 0)

        for i in range(i_max):
            for j in range(1, j_max-i):
                counts_local[j] += 1
                variogram_local[j] += estimator_func(f[i] - f[i+j])

    normalization_func(variogram_local, counts_local, j_max)

    return np.asarray(variogram_local)

To test the code, I used this script:

import numpy as np
from cython_parallel import test

z = np.array(
    (41.2, 40.2, 39.7, 39.2, 40.1, 38.3, 39.1, 40.0, 41.1, 40.3),
    dtype=np.double,
)

print(test(z))

The result should be

[0.         0.49166667 0.7625     1.09071429 0.90166667 1.336
 0.9525     0.435      0.005      0.405     ]

This is what a wrong result typically looks like

[0.         0.44319444 0.75483871 1.09053571 0.90166667 1.336
 0.9525     0.435      0.005      0.405     ]

This code mainly sums up numbers into the the vector variogram_local. Most of the time, this code works, but without having made sufficient statistics, wrong results are produced maybe every 30th time. It always works, if I change the line with nogil, parallel(): to with nogil:. It also always works, if I don't use the function pointers at all, like this:

    with nogil, parallel():
        variogram_local.resize(j_max, 0.0)
        counts_local.resize(j_max, 0)

        for i in range(i_max):
            for j in range(1, j_max-i):
                counts_local[j] += 1
                variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])

    for j in range(j_max):
        if counts_local[j] == 0:
            counts_local[j] = 1
        variogram_local[j] /= (2. * counts_local[j])

    return np.asarray(variogram_local)

The full code is tested on different platforms and these problems mainly occure on MacOS with clang, e.g.:

https://ci.appveyor.com/project/conda-forge/staged-recipes/builds/29018878

EDIT

Thanks to your input, I modified the code and with num_threads=2 it works. But as soon as num_threads>2 I get wrong results again. Do you think that, if Cython support for OpenMP would be perfect, my new code should work or am I still getting something wrong? If this should be on Cython's side, I guess I will indeed implement the code in pure C++.

def test(const double[:] f):
    cdef int i_max = f.shape[0] - 1
    cdef int j_max = i_max + 1

    cdef vector[double] variogram_local, variogram
    cdef vector[long] counts_local, counts

    cdef int i, j, k

    variogram.resize(j_max, 0.0)
    counts.resize(j_max, 0)

    with nogil, parallel(num_threads=2):
        variogram_local = vector[double](j_max, 0.0)
        counts_local = vector[long)(j_max, 0)

        for i in prange(i_max):
            for j in range(1, j_max-i):
                counts_local[j] += 1
                variogram_local[j] += (f[i] - f[i+j]) * (f[i] - f[i+j])

        for k in range(j_max):
            counts[k] += counts_local[k]
            variogram[k] += variogram_local[k]

    for i in range(j_max):
        if counts[i] == 0:
            counts[i] = 1
        variogram[i] /= (2. * counts[i])

    return np.asarray(variogram)

Solution

Contrary to their name, variogram_local and counts_local are not actually local. They are shared and all threads mess around with them in parallel, hence the undefined result.

Note that you don't actually share any work. It's just all threads doing the same thing - the whole serial task.

A somewhat sensible parallel version would look more like this:

variogram.resize(j_max, 0.0)
counts.resize(j_max, 0)

with nogil, parallel():
    for i in range(i_max):
        for j in prange(1, j_max-i):
            counts[j] += 1
            variogram[j] += estimator_func(f[i] - f[i+j])

The shared arrays are initialized outside and then the threads share the inner j-loop. Since no two threads will ever work on the same j, this is safe to do.

Now it may not be ideal to parallelize the inner loop. If you were to actually parallelize the outer loop, you would have to in fact make actual local variables and merge/reduce them afterwards.