Search code examples
c++performancegccoptimizationinline-assembly

Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used


I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.

  • If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.

    cpp time : 543070068 nanoseconds
    cpp time : 547990578 nanoseconds
    
    asm time : 185495494 nanoseconds
    asm time : 188597476 nanoseconds
    
  • If the -O flag is set to -O1 they produce the same performance.

    cpp time : 177510914 nanoseconds
    cpp time : 178084988 nanoseconds
    
    asm time : 179036546 nanoseconds
    asm time : 181641378 nanoseconds
    
  • But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++. )

    cpp time : 177522894 nanoseconds
    cpp time : 183816275 nanoseconds
    
    asm time : 125 nanoseconds
    asm time : 75 nanoseconds
    

My Questions

  • Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?

  • Is this a normal reading or there is something wrong with the timing/measurement of the performance?

  • Or maybe there is something wrong with my inline assembly function?

  • And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?

I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.

Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!


[EDIT]

So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.

Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.

code:

#include <iostream>
#include <random>
#include <chrono>

uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    for(size_t i=0; i<length; ++i) {
        sum += numbers[i];
    }
    return sum;
}

uint64_t sum_asm(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    asm volatile(
        "xorq %%rax, %%rax\n\t"
        "%=:\n\t"
        "addq (%[numbers], %%rax, 8), %[sum]\n\t"
        "incq %%rax\n\t"
        "cmpq %%rax, %[length]\n\t"
        "jne %=b"
        : [sum]"+&r"(sum)
        : [numbers]"r"(numbers), [length]"r"(length)
        : "%rax", "memory", "cc"
    );
    return sum;
}

int main() {
    std::mt19937_64 rand_engine(1);
    std::uniform_int_distribution<uint64_t> random_number(0,5000);

    size_t length = 99999999;
    uint64_t *arr = new uint64_t[length];
    for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);

    uint64_t cpp_total = 0, asm_total = 0;

    for(size_t i=0; i<5; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
        cpp_total += sum_cpp(arr, length);
#else
        asm_total += sum_asm(arr,length);
#endif
        auto end = std::chrono::high_resolution_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
        std::cout << "time : " << dur.count() << " nanoseconds\n";
    }

#ifndef _INLINE_ASM
    std::cout << "cpp sum = " << cpp_total << "\n";
#else
    std::cout << "asm sum = " << asm_total << "\n";
#endif

    delete [] arr;
    return 0;
}

Solution

  • The compiler is hoisting the inline asm out of your repeat loop, and thus out of your timed region.

    If your goal is performance, https://gcc.gnu.org/wiki/DontUseInlineAsm. The useful thing to spend your time learning first is SIMD intrinsics (and how they compile to asm) like _mm256_add_epi64 to add 4x uint64_t with a single AVX2 instruction. See https://stackoverflow.com/tags/sse/info (Compilers can auto-vectorize decently for a simple sum like this, which you could see the benefit from if you used a smaller array and put a repeat loop inside the timed region to get some cache hits.)

    If you want to play around with asm to test what's actually fast on various CPUs, you can do that in a stand-alone static executable, or a function you call from C++. https://stackoverflow.com/tags/x86/info has some good performance links.

    Re: benchmarking at -O0, yes the compiler makes slow asm with the default -O0 of consistent debugging and not trying at all to optimize. It's not much of a challenge to beat it when it has its hands tied behind its back.


    Why your asm can get hoisted out of the timed regions

    Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

    Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.

    Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.

    75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef, %eax before xor %eax,%eax, something you can search for in the debugger's disassembly output (like GDB's layout asm or layout reg; see asm debugging tips at the bottom of https://stackoverflow.com/tags/x86/info). And yes, you do often want to look at what the compiler did when debugging inline asm, how it filled in your constraints, because stepping on its toes is a very real possibility.

    Note that a "memory" clobber without asm volatile would still let GCC do Common Subexpression Elimination (CSE) between two invocations of the asm statement, if there was no function call in between. Like if you put a repeat loop inside a timed region to test performance on an array small enough to fit in some level of cache.

    Sanity-checking your benchmark

    Is this a normal reading

    It's wild that you even have to ask that. 99999999 8-byte integers in 75ns would be a memory bandwidth of 99999999 * 8 B / 75 ns = 10666666 GB/s, while fast dual-channel DDR4 might hit 32 GB/s. (Or cache bandwidth if it was that large, but it's not, so your code bottlenecks on memory).

    Or a 4GHz CPU would have had to run at 99999999 / (75*4) = 333333.33 add instructions per clock cycle, but the pipeline is only 4 to 6 uops wide on modern CPUs, with taken-branch throughputs of at best 1 for a loop branch. (https://uops.info/ and https://agner.org/optimize/)

    Even with AVX-512, that's 2/clock 8x uint64_t additions per core, but compilers don't rewrite your inline asm; that would defeat its purpose compared to using plain C++ or intrinsics.

    This is pretty obviously just std::chrono timing overhead from a near-empty timed region.


    Asm code-review: correctness

    As mentioned above, How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

    You're also missing an & early clobber declaration in "+&r"(sum) which would in theory let it pick the same register for sum as for one of the inputs. But since sum is also an input, it could only do that if numbers or length were also 0.

    It's kind of a toss-up whether it's better to xor-zero inside the asm for an "=&r" output, or better to use "+&r" and leave that zeroing to the compiler. For your loop counter, it makes sense because the compiler doesn't need to know about that at all. But by manually picking RAX for it (with a clobber), you're preventing the compiler from choosing to have your code produce sum in RAX, like it would want for a non-inline function. A dummy [idx] "=&r" (dummy) output operand will get the compiler to pick a register for you, of the appropriate width, e.g. intptr_t.


    Asm code review: performance

    As David Wohlferd said: xor %eax, %eax to zero RAX. Implicit zero-extension saves a REX prefix. (1 byte of code-size in the machine code. Smaller machine-code is generally better.)

    It doesn't seem worth hand-writing asm if you're not going to do anything smarter than what GCC would on its own without -ftree-vectorize or with -mgeneral-regs-only or -mno-sse2 (even though it's baseline for x86-64, kernel code generally needs to avoid SIMD registers). But I guess it works as a learning exercise in how inline asm constraints work, and a starting point for measuring. And to get a benchmark working so you can then test better loops.

    Typical x86-64 CPUs can do 2 loads per clock cycle (Intel since Sandybridge, AMD since K8) Or 3/clock on Alder Lake. On modern CPUs with AVX/AVX2, each load can be 32 bytes wide (or 64 bytes with AVX-512) best case on L1d hits. Or more like 1/clock with only L2 hits on recent Intel, which is a reasonable cache-blocking target.

    But your loop can at best run 1x 8-byte load per clock cycle, because loop branches can run 1/clock, and add mem, %[sum] has a 1 cycle loop-carried dependency through sum.

    That might max out DRAM bandwidth (with the help of HW prefetchers), e.g. 8 B / cycle * 4GHz = 32GB/s, which modern desktop/laptop Intel CPUs can manage for a single core (but not big Xeons). But with fast enough DRAM and/or a slower CPU relative to it, even DRAM can avoid being a bottleneck. But aiming for DRAM bandwidth is quite a low bar compared to L3 or L2 cache bandwidth.

    So even if you want to keep using scalar code without movdqu / paddq (or better get to an alignment boundary for memory-source paddq, if you want to spend some code-size to optimize this loop), you could still unroll with two register accumulators for sum which you add at the end. This exposes some instruction-level parallelism, allowing two memory-source loads per clock cycle.


    You can also avoid the cmp, which can reduce loop overhead. Fewer uops lets out-of-order exec see farther.

    Get a pointer to the end of the array and index from -length up towards zero. Like (arr+len)[idx] with for(idx=-len ; idx != 0 ; idx++). Looping backwards through the array is on some CPUs a little worse for some of the HW prefetchers, so generally not recommended for loops that are often memory bound.

    See also Micro fusion and addressing modes - an indexed addressing mode can only stay micro-fused in the back-end on Intel Haswell and later, and only for instructions like add that RMW their destination register.

    So your best bet would be a loop with one pointer increment and 2 to 4 add instructions using it, and a cmp/jne at the bottom.