Search code examples
cwindowsopenmpaffinitynuma

How to use all NUMA nodes with openMP on Windows 10


I have access to a dual-socket system consisting of two NUMA nodes to do some data processing.

My code is relatively straightforward and I'm using openMP for a main parallelizable loop that looks like this (k is a function parameter and buffer is a multi-gigabytes array of length n):

uint64_t m=0;
uint64_t *rk = (uint64_t *) calloc(k, sizeof(uint64_t));
#pragma omp parallel
{
    #pragma omp for reduction(+:m), reduction(+:rk[:k])
    for (uint64_t i=0; i<n-k; i++)
    {
        m += (uint64_t)buffer[i];
        for (uint64_t j=0; j<k; j++)
        {
            rk[j] += (uint64_t)buffer[i]*(uint64_t)buffer[i+j];
        }
    }
    /* Other stuff, serial and parallel */
}

Under Linux Mint I can compile with gcc without problem and all of the cores on both sockets are put to good use. However, on Windows (mingw-gcc on cygwin) only a single NUMA node is used. Since my code isn't really sensitive to the memory latency, I get 2x slowdown on Windows.

I can't figure out how to force Windows to spread the threads on both nodes. As far as I understand, openMP doesn't support affinity on Windows (cygwin mingw-gcc implementation anyways), but I don't know how I should do it manually.

Any help is greatly appreciated!


Solution

  • I found the cause of the issue. There is over 64 logical core on the machine, and as such Windows requires two CPU groups to address them. By default, it places each NUMA nodes in its own group.

    The fix is either disabling HTT if you have less than 64 physical cores, or disabling the NUMA grouping in the bios. In the latter case, the first 64 logical cores will be grouped and appear as a single NUMA node in Windows and the remainder is placed in the second node. The ideal solution will depend on your specific application, whether you benefit from using all the cores, or from hyperthreadng..

    [EDIT] You can also manage threads manually. If you want to do that, I suggest digging into Processtopologyapi.h and processthreadsapi.h, in particular into functions GetActiveProcessorCount and SetThreadGroupAffinity.