c multithreading parallel-processing multiprocessing openmp

How to merge private arrays into a shared one in OMP - C?

I am trying to parallelize a program for finding local maxima using reduction. However, I am encountering a problem during the merging process: the merged array ends up containing exactly two fewer elements than it should. Here's my code

    double start, stop;

    const float tab[40] = {smth...};
    float maxV3[N];
    int nb_max_locaux = 0;

#pragma omp parallel
    {
        float maxTemp[N];
        int k = 0;
        int j = 0;

#pragma omp for reduction(+:nb_max_locaux)
        for (int i = 1; i < N - 1; i++) {
            if ((tab[i - 1] < tab[i]) && (tab[i + 1] < tab[i])) {
                maxTemp[nb_max_locaux] = tab[i];
                nb_max_locaux++;
                k++;
            }
        }

#pragma omp for
        for (int i = 0; i < nb_max_locaux; i++) {
            if (j < k) {
                maxV3[i] = maxTemp[j++];
            }
        }
    }

I have tried debugging the issue and discovered that some threads, which hold several elements, complete their iterations without fully emptying their local arrays.

P.S. When running the same code with a small number of threads (e.g., 3), the final result is correct.

Solution

The reason for your observed behavior is the work distribution in your final for loop. Although you do not specify a schedule, most implementations will use a static schedule distributing the iterations equally to the threads. If not all threads detected the same number of local maxima, some thread will have to few iterations to write out all its values.

Using an atomic index, the code can look like:

    const float tab[40] = {smth...};
    float maxV3[N];
    int nb_max_locaux = 0;
#pragma omp parallel for
    for (int i = 1; i < N - 1; i++) {
        if ((tab[i - 1] < tab[i]) && (tab[i + 1] < tab[i])) {
#pragma omp atomic capture
            int index = nb_max_locaux++;
            maxTemp[index] = tab[i];
        }
    }

This will lead to contention for atomically incrementing nb_max_locaux and also writing to maxTemp[]. The latter is caused by random threads writing to the same cache line.

This solution will also not preserve the ordering of the values.