OpenMP gets significantly slower when Array reaches certain size

I am trying to solve a homework example at parallel computing. Here is the given code snippet:

for (int i=0; i < n-1; i++) {
    x[i] = (y[i] + x[i+1]) / 7;
}

We have to analyze the code snippet regarding dependencies and try to parallelize the given snippet and benchmark it with a propper sample size. My first try was just to copy x and remove the anti dependence here:

double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
#pragma omp parallel for 
for (int i = 0; i < n-1; i++) {
    x[i] = (y[i] + x_old[i]) / 7; 
}

this code works, i checked the serial and parallel checksums of the array and they are both the same. The speedup without calculating the memcpy() in is between 1.6 - 2 for 100_000_00 - 400_000_000 array elements. But if I increase the array size further (at arround 470_000_000 elements) the speedup decreases significantly and at 500_000_000 array elements the speedup is 0.375. It is irrelevant how many threads I use.

I benchmarked it on 3 different devices with the same result (speedup wise):

Lenovo Ideapad 5

Desktop PC

and at our cluster node PC

Why is there a drop in speedup and why does it drops so fast? Why does this happen at that specific size of the array?

I provide a minimum example of the programm here:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include "omp.h"

void init(double *arr, int size);
double sum(double *arr, int size);

int main(int argc, char **argv){
    if(argc != 3){
        printf("Wrong use of program! (1) Size of array, (2) num threads\n");
        return EXIT_FAILURE;
    }
    int n = atoi(argv[1]);
    int threads = atoi(argv[2]);
    omp_set_num_threads(threads);

    double *x = malloc(sizeof(*x) * n);
    double *y = malloc(sizeof(*y) * n);
    init(x, n);
    init(y, n);
    double start = omp_get_wtime();
    for (int i = 0; i < n-1; i++) {
        x[i] = (y[i] + x[i+1]) / 7;
    }
    double end = omp_get_wtime();
    double elapsed_s = end - start;
    printf("single elapsed: %fs\n", elapsed_s);
    double checksum_s = sum(x, n);
    printf("single checksum: %f\n", checksum_s);
    init(x, n); //reinit
    init(y, n); //reinit
    double *x_old = malloc(sizeof(*x_old) * n); // the plain "copy" of x
    memcpy(x_old, &x[1], (n - 1) * sizeof(*x)); // to prevent i + 1 in loop we start x[1] and copy n - 1 elements
    start = omp_get_wtime();
    #pragma omp parallel for 
    for (int i = 0; i < n-1; i++) {
        x[i] = (y[i] + x_old[i]) / 7; 
    }
    end = omp_get_wtime();
    double elapsed_p = end - start;
    printf("omp threads %d Thread(s) elapsed: %fs\n", omp_get_max_threads(), elapsed_p);
    double checksum_p = sum(x, n);
    printf("omp threads %d Thread(s) single checksum: %f\n", omp_get_max_threads(), checksum_p);
    printf("%s\n", fabs((checksum_p - checksum_s)) < __DBL_EPSILON__ ? "OK" : "FAILED");
    printf("Speedup: %.3f\n", elapsed_s / elapsed_p);
}

void init(double *arr, int size){
    for(int i = 0; i < size; ++i){
        arr[i] = 1.0;
    }
}
double sum(double *arr, int size){
    double sum = 0.0;
    for(int i = 0; i < size; ++i){
        sum += arr[i];
    }
    return sum;
}

here a sample benchmark:

Solution

The problem certainly comes from swap memory. Indeed, the code allocates 3 array of 500_000_000 items resulting in 4 GiB of RAM required for each array and 12 GiB overall. The thing is two of the benchmarking machines have only 16 GiB of RAM available and the OS as well as running applications generally require already 1-4 GiB of RAM. When the amount of available physical memory becomes small, the OS starts to move memory pages on a space called swap memory which is typically mapped on your storage device that is far much slower than usual RAM. Note that some operating systems like Windows already use ZRAM so to compress data when the available amount of memory becomes small. Mainstream OS (Windows, Linux and MAC also need some additional space for buffers (eg. storage device buffers, networking, etc.). The allocation of memory pages can also becomes slower when the amount of memory becomes small. Also note that pages are allocated in RAM only during the first touch on mainstream systems. For more information about this, please read this article.

Allocating a huge x_old array is not efficient (in both time and space). Indeed, the memcopy will take a lot of time to copy data in memory and this operation is limited by the throughput of the RAM which is generally low. You can speed up the operation by operating on chunks. The idea is to copy a small data block (of let say 16 KiB) in each thread so the operation can be done in the L1 cache of each core. This means that the cost of the memcpy will nearly vanish since CPU caches are much faster than the RAM (both the latency and the throughput). In fact, the L1 CPU cache of the notebook/desktop computer should be >100 GiB/core, resulting in a cumulated throughput >600 GiB, while the throughput of the RAM should not exceed 40 GiB/s in practice (15x slower).

For the cluster node, it is a bit more complex since it is subject to NUMA effects. NUMA architectures are pretty complex. For more information, I advise you to read this article first, then read Explanation for why effective DRAM bandwidth reduces upon adding CPUs (including the related links) and Problem of sorting OpenMP threads into NUMA nodes by experiment. Put it shortly, you need to care about where the pages are allocated (allocation policy). The key is to control the first touch on pages (on most platforms).