Using 1GB pages degrade performance

I have an application where I need about 850 MB of continuous memory and be accessing it in a random manner. I was suggested to allocate a huge page of 1 GB, so that it would always be in TLB. I've written a demo with sequential/random accesses to measure the performance for small (4 KB in my case) vs large (1 GB) page:

#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>

#define MAP_HUGE_2MB (21 << MAP_HUGE_SHIFT) // Aren't used in this example.
#define MAP_HUGE_1GB (30 << MAP_HUGE_SHIFT)
#define MESSINESS_LEVEL 512 // Poisons caches if LRU policy is used.

#define RUN_TESTS 25

void print_usage() {
  printf("Usage: ./program small|huge1gb sequential|random\n");
}

int main(int argc, char *argv[]) {
  if (argc != 3 && argc != 4) {
    print_usage();
    return -1;
  }
  uint64_t size = 1UL * 1024 * 1024 * 1024; // 1GB
  uint32_t *ptr;
  if (strcmp(argv[1], "small") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, // basically malloc(size);
               MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap small");
      exit(1);
    }
  } else if (strcmp(argv[1], "huge1gb") == 0) {
    ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
               MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_HUGE_1GB, -1, 0);
    if (ptr == MAP_FAILED) {
      perror("mmap huge1gb");
      exit(1);
    }
  } else {
    print_usage();
    return -1;
  }

  clock_t start_time, end_time;
  start_time = clock();

  if (strcmp(argv[2], "sequential") == 0) {
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
        ptr[i] = i * 5;
    }
  } else if (strcmp(argv[2], "random") == 0) {
    // pseudorandom access pattern, defeats caches.
    uint64_t index;
    for (int iter = 0; iter < RUN_TESTS; iter++) {
      for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
        for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
          index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
          ptr[index] = index * 5;
        }
      }
    }
  } else {
    print_usage();
    return -1;
  }

  end_time = clock();
  long double duration = (long double)(end_time - start_time) / CLOCKS_PER_SEC;
  printf("Avr. Duration per test: %Lf\n", duration / RUN_TESTS);
  //  write(1, ptr, size); // Dumps memory content (1GB to stdout).
}

And on my machine (more below) the results are:

Sequential:

$ ./test small sequential
Avr. Duration per test: 0.562386
$ ./test huge1gb sequential        <--- slightly better
Avr. Duration per test: 0.543532

Random:

$ ./test small random              <--- better
Avr. Duration per test: 2.911480
$ ./test huge1gb random
Avr. Duration per test: 6.461034

I'm bothered with the random test, it seems that a 1GB page is 2 times slower! I tried using madvise with MADV_SEQUENTIAL / MADV_SEQUENTIAL for respective tests, it didn't help.

Why does using a one huge page in case of random accesses degrades performance? What are the use-cases for huge pages (2MB and 1GB) in general?

I didn't test this code with 2MB pages, I think it should probably do better. I also suspect that since a 1GB page is stored in one memory bank it probably has something to do with multi-channels. But I would like to hear from you folks. Thanks.

Note: to run the test you must first enable 1GB pages in your kernel. You can do it by giving kernel this parameters hugepagesz=1G hugepages=1 default_hugepagesz=1G. More: https://wiki.archlinux.org/index.php/Kernel_parameters. If enabled, you should get something like:

$ cat /proc/meminfo | grep Huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:         1048576 kB

EDIT1: My machine has Core i5 8600 and 4 memory banks 4 GB each. The CPU natively supports both 2MB and 1GB pages (it has pse & pdpe1gb flags, see: https://wiki.debian.org/Hugepages#x86_64). I was measuring machine time, not CPU time, I updated the code and the results now are average of 25 tests.

I was also told that this test does better on 2MB pages than normal 4KB ones.

Solution

Intel was kind enough to reply to this issue. See their answer below.

This issue is due to how physical pages are actually committed. In case of 1GB pages, the memory is contiguous. So, as soon as you write to any one byte within the 1GB page, the entire 1GB page is assigned. However, with 4KB pages, the physical pages get allocated as and when you touch for the first time in each of the 4KB pages.

for (uint64_t i = 0; i < size / MESSINESS_LEVEL / sizeof(*ptr); i++) {
   for (uint64_t j = 0; j < MESSINESS_LEVEL; j++) {
       index = i + j * size / MESSINESS_LEVEL / sizeof(*ptr);
           ptr[index] = index * 5;
   }
}

In the innermost loop, the index changes at a stride of 512KB. So, consecutive references map at 512KB offsets. Typically caches have 2048 sets (which is 2^11). So, bits 6:16 select the sets. But if you stride at 512KB offsets, bits 6:16 would be the same ending up selecting the same set and losing the spatial locality.

We would recommend initializing the entire 1GB buffer sequentially (in the small page test) as below before starting the clock to time it

for (uint64_t i = 0; i < size / sizeof(*ptr); i++)
    ptr[i] = i * 5;

Basically, the issue is with set conflicts resulting in cache misses in case of huge pages compared to small pages due to very large constant offsets. When you use constant offsets, the test is really not random.