Search code examples
cassemblygccx86

Why GCC generates a mov of the array's beginning on every loop iteration to access array using []? (-O3, x86)


Description

I created a sample to study the TLB access/misses statistics. Sample writes 1 to every 4096-th element of the array. Array has 10'000 * 4096 bytes. I expect to see 10'000 TLB stores only, but generated assembly loads beginning of the array every iteration, resulting in 10'000 TLB loads in addition to stores. -O3 optimization is applied

When I looked into the assembly, I noticed that the for-loop looks like that:

  1. move beginning of the array to the register
  2. set to 1 a shifted beginning of the array
  3. increase index
  4. jump to step 1

Question: Why step 1 is executed every single iteration? Beginning of the array is not changing. I expect the beginning to be loaded once and the jump to be to step 2

C code

(main just calls this test_function 10K times):

#include <stdlib.h>

#define REPEAT 10000
#define PAGESIZE 4096
#define PAGES 10000

char *data = NULL;

inline void test_function()
{
    for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) {
        data[i] = 1;
    }
}


int main()
{
    data = (char *) malloc(PAGES * PAGESIZE);
    for (int idx = 0; idx < REPEAT; idx++)
        test_function();
    free(data);
    return 0;
}

Generated assembly with gcc and -O3

    1090:       mov    rdx,QWORD PTR [rip+0x2f81]        # 4018 <data>
    1097:       mov    BYTE PTR [rdx+rax*1],0x1
    109b:       add    rax,0x1000
    10a1:       cmp    rax,0x2710000
    10a7:       jne    1090 <main+0x30>

perf stats for 100'000 repetitions

Per function call we can see:

  • 10K L1 cache loads, 10K L1 cache stores
  • 10K TLB loads, 10K TLB stores
  • ~0 TLB load misses, 10K TLB store misses So load of the array's beginning is always cached in TLB, but it's still accessed. Why?
1002500646      L1-dcache-load:u          (66.50%)
 998346580      L1-dcache-stores:u        (66.68%)
 995750153      dTLB-loads:u              (66.81%)
 998020884      dTLB-stores:u             (66.74%)
     34586      dTLB-loads-misses:u       (66.68%)
1003649391      dTLB-stores-misses:u      (66.58%)

Platform

Intel(R) Core(TM) i7-10610U CPU

Ubuntu 22.04.3 LTS

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Compilation line: g++ ./tlb.cpp -O3 -g -o gcc.out


Solution

  • The code does not know what data holds when the code runs. Therefore, it is conceivable that data[i] overlays data:

    data = (char *)&data;
    

    when this happens, assigning to data[i] could modify data itself. Thus the compiler cannot simply assume they do not alias and must reload data after each access. Granted, with some more code it could fix this, but the compiler writers have not implemented that.

    To avoid this problem, make sure to cache global variables in local variables if this sort of problem could occur:

    inline void test_function()
    {
        char *d;
    
        d  = data;
        for (int i = 0; (i < PAGES * PAGESIZE); i += (PAGESIZE)) {
            d[i] = 1;
        }
    }