multithreading performance openmp affinity numa

OpenMP: splitting loop based on NUMA

I am running the following loop using, say, 8 OpenMP threads:

float* data;
int n;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data, n)
for ( int i = 0; i < n; ++i )
{
    DO SOMETHING WITH data[i]
}

Due to NUMA, I'd like to run first half of the loop (i = 0, ..., n/2-1) with threads 0,1,2,3 and second half (i = n/2, ..., n-1) with threads 4,5,6,7.

Essentially, I want to run two loops in parallel, each loop using a separate group of OpenMP threads.

How do I achieve this with OpenMP?

Thank you

PS: Ideally, if threads from one group are done with their half of the loop, and the other half of the loop is still not done, I'd like threads from finished group join unsfinished group processing the other half of the loop.

I am thinking about something like below, but I wonder if I can do this with OpenMP and no extra book-keeping:

int n;
int i0 = 0;
int i1 = n / 2;

#pragma omp parallel for schedule(dynamic, 1) default(none) shared(data,n,i0,i1)
for ( int i = 0; i < n; ++i )
{
    int nt = omp_get_thread_num();
    int j;
    #pragma omp critical
    {
        if ( nt < 4 ) {
            if ( i0 < n / 2 ) j = i0++; // First 4 threads process first half
            else              j = i1++; // of loop unless first half is finished
        }
        else {
            if ( i1 < n ) j = i1++;  // Second 4 threads process second half
            else          j = i0++;  // of loop unless second half is finished 
        }
    }

    DO SOMETHING WITH data[j]
}

Solution

Probably best is to use nested parallelization, first over NUMA nodes, then within each node; then you can use the infrastructure for dynamic while still breaking the data up amongst thread groups:

#include <omp.h>
#include <stdio.h>

int main(int argc, char **argv) {

    const int ngroups=2;
    const int npergroup=4;
    const int ndata = 16;

    omp_set_nested(1);
    #pragma omp parallel for num_threads(ngroups)
    for (int i=0; i<ngroups; i++) {
        int start = (ndata*i+(ngroups-1))/ngroups;
        int end  = (ndata*(i+1)+(ngroups-1))/ngroups;    

        #pragma omp parallel for num_threads(npergroup) shared(i, start, end) schedule(dynamic,1)
        for (int j=start; j<end; j++) {
            printf("Thread %d from group %d working on data %d\n", omp_get_thread_num(), i, j);
        }
    }

    return 0;
}

Running this gives

$ gcc -fopenmp -o nested nested.c -Wall -O -std=c99
$ ./nested | sort -n -k 9
Thread 0 from group 0 working on data 0
Thread 3 from group 0 working on data 1
Thread 1 from group 0 working on data 2
Thread 2 from group 0 working on data 3
Thread 1 from group 0 working on data 4
Thread 3 from group 0 working on data 5
Thread 3 from group 0 working on data 6
Thread 0 from group 0 working on data 7
Thread 0 from group 1 working on data 8
Thread 3 from group 1 working on data 9
Thread 2 from group 1 working on data 10
Thread 1 from group 1 working on data 11
Thread 0 from group 1 working on data 12
Thread 0 from group 1 working on data 13
Thread 2 from group 1 working on data 14
Thread 0 from group 1 working on data 15

But note that the nested approach may well change the thread assignments over what the one-level threading would be, so you will probably have to play with KMP_AFFINITY or other mechanisms a bit more to get the bindings right again.