OpenMP array initialization impact

I am working in parallel with OpenMP on an array (working part). If I initialize the array in parallel before, then my working part takes 18 ms. If I initialize the array serially without OpenMP, then my working part takes 58 ms. What causes the worse performance?

The system:

Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores / 56 threads, 2 Sockets)

Example code:

unsigned long sum = 0;
long* array = (long*)malloc(sizeof(long) * 160000000);

// Initialisation
#pragma omp parallel for num_threads(56) schedule(static)
for(unsigned int i = 0; i < array_length; i++){
    array[i]= i%10;
}


// Time start

// Work
#pragma omp parallel for num_threads(56) shared(array, 160000000) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
    if (array[i] < 4)
    {
        sum += array[i];
    }
}

// Time End

Solution

There are two aspects at work here:

NUMA allocation

In a NUMA system, memory pages can be local to a CPU or remote. By default Linux allocates memory in a first-touch policy, meaning the first write access to a memory page determines on which node the page is physically allocated.

If your malloc is large enough that new memory is requested from the OS (instead of reusing existing heap memory), this first touch will happen in the initialization. Because you use static scheduling for OpenMP, the same thread will use the memory that initialized it. Therefore, unless the thread gets migrated to a different CPU, which is unlikely, the memory will be local.

If you don't parallelize the initialization, the memory will end up local to the main thread which will be worse for threads that are on different sockets.

Note that Windows doesn't use a first-touch policy (AFAIK). So this behavior is not portable.

Caching

The same as above also applies to caches. The initialization will put array elements into the cache of the CPU doing it. If the same CPU accesses the memory during the second phase, it will be cache-hot and ready to use.