c performance profiling cpu-cache cachegrind

How can I pinpoint if the slowness in my program is a CPU cache issue (on Linux)?

I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program.

My program looks a bit like this:

int large_buffer[10000];

void compute(FILE * input) {
    for(int i=0; i<100; i++) {
        do_lots_of_stuff();
        printf(".");
        fflush(stdout);
    }
}

int main() {
    FILE *input = fopen("input.txt", "r");
    compute(input);
    fclose(input); // <--- everything gets 2x slower if I comment this out (!)
    return 0;
}

In theory, that fclose(input) line at the end of the main function should not matter, since the OS should automatically close the file at the end of the program anyway. However I observed that my program took 2.5 seconds to run when I included the fclose statement and 5s when I commented it out. A factor of 2 difference! And this was not due to latency at the start or end of the program: the speed at which the . are printed out is visibly faster in the version with the fclose statement.

I suspect that this might have to do with some memory alignment or cache miss issue. If I replace the fclose with another function such as ftell it also takes 5s to run and if I reduce the size of the large_buffer to <= 8000 elements then it always runs in 2.5 seconds, regardless of the fclose statement being there or not.

But I would really like to be able to 100% sure of what is the culprit behind this strange behavior. Would it be possible to run my program under some sort of profiler or other tool that would give me that information? So far I tried running both versions under valgrind --tool=cachegrind but it reported the same amount of cache miss (0%) for both versions of my program.

edit 1: After running both versions of my program under perf stat -d -d -d I got the following results:

 Performance counter stats for './no-fclose examples/bench.o':

       5625.535086      task-clock (msec)         #    1.000 CPUs utilized          
                38      context-switches          #    0.007 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                54      page-faults               #    0.010 K/sec                  
    17,851,853,580      cycles                    #    3.173 GHz                      (53.23%)
     6,421,955,412      stalled-cycles-frontend   #   35.97% frontend cycles idle     (53.23%)
     4,919,383,925      stalled-cycles-backend    #   27.56% backend cycles idle      (53.23%)
    13,294,878,129      instructions              #    0.74  insn per cycle         
                                                  #    0.48  stalled cycles per insn  (59.91%)
     3,178,485,061      branches                  #  565.010 M/sec                    (59.91%)
       440,171,927      branch-misses             #   13.85% of all branches          (59.92%)
     4,778,577,556      L1-dcache-loads           #  849.444 M/sec                    (60.19%)
           125,313      L1-dcache-load-misses     #    0.00% of all L1-dcache hits    (60.22%)
            12,110      LLC-loads                 #    0.002 M/sec                    (60.25%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
        20,196,491      L1-icache-load-misses                                         (60.22%)
     4,793,012,927      dTLB-loads                #  852.010 M/sec                    (60.18%)
               683      dTLB-load-misses          #    0.00% of all dTLB cache hits   (60.13%)
             3,443      iTLB-loads                #    0.612 K/sec                    (53.38%)
                90      iTLB-load-misses          #    2.61% of all iTLB cache hits   (53.31%)
   <not supported>      L1-dcache-prefetches                                        
            51,382      L1-dcache-prefetch-misses #    0.009 M/sec                    (53.24%)

       5.627225926 seconds time elapsed

 Performance counter stats for './yes-fclose examples/bench.o':

       2652.609254      task-clock (msec)         #    1.000 CPUs utilized          
                15      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                57      page-faults               #    0.021 K/sec                  
     8,277,447,108      cycles                    #    3.120 GHz                      (53.39%)
     2,453,171,903      stalled-cycles-frontend   #   29.64% frontend cycles idle     (53.46%)
     1,235,728,409      stalled-cycles-backend    #   14.93% backend cycles idle      (53.53%)
    13,296,127,857      instructions              #    1.61  insn per cycle         
                                                  #    0.18  stalled cycles per insn  (60.20%)
     3,177,698,785      branches                  # 1197.952 M/sec                    (60.20%)
        71,034,122      branch-misses             #    2.24% of all branches          (60.20%)
     4,790,733,157      L1-dcache-loads           # 1806.046 M/sec                    (60.20%)
            74,908      L1-dcache-load-misses     #    0.00% of all L1-dcache hits    (60.20%)
            15,289      LLC-loads                 #    0.006 M/sec                    (60.19%)
   <not supported>      LLC-load-misses                                             
   <not supported>      L1-icache-loads                                             
           140,750      L1-icache-load-misses                                         (60.08%)
     4,792,716,217      dTLB-loads                # 1806.793 M/sec                    (59.93%)
             1,010      dTLB-load-misses          #    0.00% of all dTLB cache hits   (59.78%)
               113      iTLB-loads                #    0.043 K/sec                    (53.12%)
               167      iTLB-load-misses          #  147.79% of all iTLB cache hits   (53.44%)
   <not supported>      L1-dcache-prefetches                                        
            29,744      L1-dcache-prefetch-misses #    0.011 M/sec                    (53.36%)

       2.653584624 seconds time elapsed

Looks like there were few data-cache misses in both cases, as kcachegrind had reported, but the slower version of the program had worse branch prediction and more instruction cache misses and iTLB loads. Which of these differences would be the most likely to be responsible for the 2x difference in runtime between the test cases?

edit 2: Fun fact, apparently I can still keep the strange behavior if I replace the "fclose" call by a single NOP instruction.

edit 3: My processor is an Intel i5-2310 (Sandy Bridge)

Edit 4: Turns out that if I resize the arrays by editing the assembly file it does not get faster. Apparetly the reason it got faster when I changed their sizes in the C code was because gcc decided to rearange the order of the stuff in the binary.

Edit 5: More evidence that what matters was the precise addresses of the JMP instructions: If I add a single NOP (instead of a printf) at the start of my code it gets faster. Similarly, if I remove an useless instruction from the start of my code it also gets faster. And when I compiled my code on a different version of gcc it also got faster, despite the generated assembly code being the same. The only difference was debug info at the start and that the sections of the binary file were in a different order.

Solution

The Key Metric

Your culprit is branch misses:

440,171,927      branch-misses             #   13.85% of all branches

vs.

71,034,122      branch-misses             #    2.24% of all branches

I'm not sure which processor your'e running, but if you are hypothetically running a 2.5 GHz processor on Haswell, you'll see that each branch prediction miss costs around 15 cycles (usually a bit more because other stuff stalls too), and each cycle is 0.4ns. So, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. It will depend on your exact processor and the machine code, but this explains the bulk of the difference right there.

Cause

The branch prediction algorithms of different chips are proprietary, but if you research here ( http://www.agner.org/optimize/microarchitecture.pdf) you can learn more about different processors and there limitations. Essentially, what you'll find is that certain processors do a better job of avoiding collisions in the branch prediction tables that they store to make predictions about branches taken or not.

So, why is that relevant? Well, what has happened (99% chance) is that by rearranging your program very slightly, you change where exactly different branches are in terms of memory locations. There are too many that map to the same bucket in the branch prediction tables for the processor. By changing the executable slightly, this problem goes away. You have to have a very specific distance between the two branch points to trigger this issue. You have unluckily managed to do that.

Simple Solution

As you found, many changes will in fact cause the program to not hit this degraded performance. Essentially, anything that changes the distance between the two critical branch points will fix the problem. You can accomplish this by literally inserting 16 bytes (or enough to move the branch points to different buckets) of machine code nops somewhere between the two places. You can also do as you did and change something that will disrupt this distance to a non-pathological case.

Digging Deeper

If you want to truly understand what causes it in this situation, you'll need to get your hands dirty. Fun! You will need to select one of many tools to find the specific branch that is being mispredicted. Here is one way: How to measure mispredictions for a single branch on Linux?

After you identify the mispredicted branch, you can figure out if there is a way to remove the branch (almost always a good idea for performance anyway). If not, you can place a hint for it or, worst case, just move things around to ensure that the same entry isn't shared as previously suggested.

Broader lessons

Programmers underestimate the cost of branching (when the compiler is unable to remove the branches at compile time). As you've discovered, each branch puts more strain on the branch prediction buffers of a processor and increases the chance of mispredictions. So, even branches that are 100% predictable to the processor can degrade performance by reducing the resources available for predicting other branches. Further, when a branch is mispredicted, it costs a minimum of 12-15 cycles and can be much more if the needed instructions aren't in the L1 cache, L2 cache, L3 cache, or heaven help you, main memory. Additionally, the compiler can't make all optimizations across branches.