I'm currently trying to understand some very very strange behavior in one of my C programs. Apparently, adding or removing a seemingly inconsequential line at the end of it drastically affects the performance in the rest of the program.
My program looks a bit like this:
int large_buffer[10000];
void compute(FILE * input) {
for(int i=0; i<100; i++) {
do_lots_of_stuff();
printf(".");
fflush(stdout);
}
}
int main() {
FILE *input = fopen("input.txt", "r");
compute(input);
fclose(input); // <--- everything gets 2x slower if I comment this out (!)
return 0;
}
In theory, that fclose(input)
line at the end of the main function should not matter, since the OS should automatically close the file at the end of the program anyway. However I observed that my program took 2.5 seconds to run when I included the fclose statement and 5s when I commented it out. A factor of 2 difference! And this was not due to latency at the start or end of the program: the speed at which the .
are printed out is visibly faster in the version with the fclose statement.
I suspect that this might have to do with some memory alignment or cache miss issue. If I replace the fclose with another function such as ftell it also takes 5s to run and if I reduce the size of the large_buffer
to <= 8000 elements then it always runs in 2.5 seconds, regardless of the fclose statement being there or not.
But I would really like to be able to 100% sure of what is the culprit behind this strange behavior. Would it be possible to run my program under some sort of profiler or other tool that would give me that information? So far I tried running both versions under valgrind --tool=cachegrind
but it reported the same amount of cache miss (0%) for both versions of my program.
edit 1: After running both versions of my program under perf stat -d -d -d
I got the following results:
Performance counter stats for './no-fclose examples/bench.o':
5625.535086 task-clock (msec) # 1.000 CPUs utilized
38 context-switches # 0.007 K/sec
0 cpu-migrations # 0.000 K/sec
54 page-faults # 0.010 K/sec
17,851,853,580 cycles # 3.173 GHz (53.23%)
6,421,955,412 stalled-cycles-frontend # 35.97% frontend cycles idle (53.23%)
4,919,383,925 stalled-cycles-backend # 27.56% backend cycles idle (53.23%)
13,294,878,129 instructions # 0.74 insn per cycle
# 0.48 stalled cycles per insn (59.91%)
3,178,485,061 branches # 565.010 M/sec (59.91%)
440,171,927 branch-misses # 13.85% of all branches (59.92%)
4,778,577,556 L1-dcache-loads # 849.444 M/sec (60.19%)
125,313 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.22%)
12,110 LLC-loads # 0.002 M/sec (60.25%)
<not supported> LLC-load-misses
<not supported> L1-icache-loads
20,196,491 L1-icache-load-misses (60.22%)
4,793,012,927 dTLB-loads # 852.010 M/sec (60.18%)
683 dTLB-load-misses # 0.00% of all dTLB cache hits (60.13%)
3,443 iTLB-loads # 0.612 K/sec (53.38%)
90 iTLB-load-misses # 2.61% of all iTLB cache hits (53.31%)
<not supported> L1-dcache-prefetches
51,382 L1-dcache-prefetch-misses # 0.009 M/sec (53.24%)
5.627225926 seconds time elapsed
Performance counter stats for './yes-fclose examples/bench.o':
2652.609254 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.006 K/sec
0 cpu-migrations # 0.000 K/sec
57 page-faults # 0.021 K/sec
8,277,447,108 cycles # 3.120 GHz (53.39%)
2,453,171,903 stalled-cycles-frontend # 29.64% frontend cycles idle (53.46%)
1,235,728,409 stalled-cycles-backend # 14.93% backend cycles idle (53.53%)
13,296,127,857 instructions # 1.61 insn per cycle
# 0.18 stalled cycles per insn (60.20%)
3,177,698,785 branches # 1197.952 M/sec (60.20%)
71,034,122 branch-misses # 2.24% of all branches (60.20%)
4,790,733,157 L1-dcache-loads # 1806.046 M/sec (60.20%)
74,908 L1-dcache-load-misses # 0.00% of all L1-dcache hits (60.20%)
15,289 LLC-loads # 0.006 M/sec (60.19%)
<not supported> LLC-load-misses
<not supported> L1-icache-loads
140,750 L1-icache-load-misses (60.08%)
4,792,716,217 dTLB-loads # 1806.793 M/sec (59.93%)
1,010 dTLB-load-misses # 0.00% of all dTLB cache hits (59.78%)
113 iTLB-loads # 0.043 K/sec (53.12%)
167 iTLB-load-misses # 147.79% of all iTLB cache hits (53.44%)
<not supported> L1-dcache-prefetches
29,744 L1-dcache-prefetch-misses # 0.011 M/sec (53.36%)
2.653584624 seconds time elapsed
Looks like there were few data-cache misses in both cases, as kcachegrind had reported, but the slower version of the program had worse branch prediction and more instruction cache misses and iTLB loads. Which of these differences would be the most likely to be responsible for the 2x difference in runtime between the test cases?
edit 2: Fun fact, apparently I can still keep the strange behavior if I replace the "fclose" call by a single NOP instruction.
edit 3: My processor is an Intel i5-2310 (Sandy Bridge)
Edit 4: Turns out that if I resize the arrays by editing the assembly file it does not get faster. Apparetly the reason it got faster when I changed their sizes in the C code was because gcc decided to rearange the order of the stuff in the binary.
Edit 5: More evidence that what matters was the precise addresses of the JMP instructions: If I add a single NOP (instead of a printf) at the start of my code it gets faster. Similarly, if I remove an useless instruction from the start of my code it also gets faster. And when I compiled my code on a different version of gcc it also got faster, despite the generated assembly code being the same. The only difference was debug info at the start and that the sections of the binary file were in a different order.
The Key Metric
Your culprit is branch misses:
440,171,927 branch-misses # 13.85% of all branches
vs.
71,034,122 branch-misses # 2.24% of all branches
I'm not sure which processor your'e running, but if you are hypothetically running a 2.5 GHz processor on Haswell, you'll see that each branch prediction miss costs around 15 cycles (usually a bit more because other stuff stalls too), and each cycle is 0.4ns. So, 0.4ns/cycle * 15 cycles/missed branch * (440,171,927 - 71,034,122) branch misses = 2.2 seconds. It will depend on your exact processor and the machine code, but this explains the bulk of the difference right there.
Cause
The branch prediction algorithms of different chips are proprietary, but if you research here ( http://www.agner.org/optimize/microarchitecture.pdf) you can learn more about different processors and there limitations. Essentially, what you'll find is that certain processors do a better job of avoiding collisions in the branch prediction tables that they store to make predictions about branches taken or not.
So, why is that relevant? Well, what has happened (99% chance) is that by rearranging your program very slightly, you change where exactly different branches are in terms of memory locations. There are too many that map to the same bucket in the branch prediction tables for the processor. By changing the executable slightly, this problem goes away. You have to have a very specific distance between the two branch points to trigger this issue. You have unluckily managed to do that.
Simple Solution
As you found, many changes will in fact cause the program to not hit this degraded performance. Essentially, anything that changes the distance between the two critical branch points will fix the problem. You can accomplish this by literally inserting 16 bytes (or enough to move the branch points to different buckets) of machine code nops somewhere between the two places. You can also do as you did and change something that will disrupt this distance to a non-pathological case.
Digging Deeper
If you want to truly understand what causes it in this situation, you'll need to get your hands dirty. Fun! You will need to select one of many tools to find the specific branch that is being mispredicted. Here is one way: How to measure mispredictions for a single branch on Linux?
After you identify the mispredicted branch, you can figure out if there is a way to remove the branch (almost always a good idea for performance anyway). If not, you can place a hint for it or, worst case, just move things around to ensure that the same entry isn't shared as previously suggested.
Broader lessons
Programmers underestimate the cost of branching (when the compiler is unable to remove the branches at compile time). As you've discovered, each branch puts more strain on the branch prediction buffers of a processor and increases the chance of mispredictions. So, even branches that are 100% predictable to the processor can degrade performance by reducing the resources available for predicting other branches. Further, when a branch is mispredicted, it costs a minimum of 12-15 cycles and can be much more if the needed instructions aren't in the L1 cache, L2 cache, L3 cache, or heaven help you, main memory. Additionally, the compiler can't make all optimizations across branches.