c++assembly visual-c++compiler-optimization

Very verbose ASM code of MSVC /Os vs GCC -O2 for simple template code

I was looking some example with std::visit, and I wanted to explore a bit the following common example code:

#include <iostream>
#include <variant>

struct Fluid { };
struct LightItem { };
struct HeavyItem { };
struct FragileItem { };

template<class... Ts> struct overload : Ts... { using Ts::operator()...; };
template<class... Ts> overload(Ts...) -> overload<Ts...>; // line not needed in C++20...

int main() {
    std::variant<Fluid, LightItem, HeavyItem, FragileItem> package(HeavyItem{});

    std::visit(overload{
        [](Fluid& )       { std::cout << "fluid\n"; },
        [](LightItem& )   { std::cout << "light item\n"; },
        [](HeavyItem& )   { std::cout << "heavy item\n"; },
        [](FragileItem& ) { std::cout << "fragile\n"; }
    }, package);
}

I've compiled the code with both GCC and MSVC, and I've noticed that in the last case the amount of generated ASM code is order of magnitude greater than the GCC one.

Here the code compiled with GCC.

Here the code compiled with MSVC.

Is there a way to know why there's so much difference? Is there a way to optimize with MSVC in order to obtain an ASM similar to the GCC one?

Solution

MSVC /Os alone doesn't enable any(?) optimization, just changes the tuning if you were to enable optimization. Code-gen is still like a debug build. Apparently it needs to be combined with other options to be usable? It's not like GCC -Os, for that use MSVC -O1.

If you look at the asm source instead of the binary disassembly, it's easier to see that MSVC's main calls a constructor, std::variant<...>::variant<...>, zeros some memory, then calls std::visit. But GCC has obviously inlined it down to just a cout<<

MSVC also inlines and constant-propagates through std::visit if you tell it to fully optimize, with -O2 or -O1 instead of /Os. https://godbolt.org/z/5MdcYh9xn has a main that's about the same as GCC's, just calling cout's operator<< with the address of a constant.

MSVC's docs don't make it clear which options actually enable (some/any) optimization vs. just biasing the choices if some other option enables some optimization.

/O1 sets a combination of optimizations that generate minimum size code.
/O2 sets a combination of optimizations that optimizes code for maximum speed.
...
/Os tells the compiler to favor optimizations for size over optimizations for speed.
/Ot (a default setting) tells the compiler to favor optimizations for speed over optimizations for size.

[But note that optimization in general is off by default, and this being the default doesn't change that. So /Os and /Ot don't seem to enable optimization at all.]
/Ox is a combination option that selects several of the optimizations with an emphasis on speed. /Ox is a strict subset of the /O2 optimizations.

If I hadn't tested, I'd have assumed from that doc that -Os would enable at least some optimizations. (MSVC accepts both - and / as the start of an option name; I wrote - in most of this answer because that's what Unix/Linux use and I know that MSVC accepts it.)

(MSVC always prints a ton of stuff in its asm source output, including stand-alone definitions for template functions that got inlined. I assume that's why you were using compile-to-binary to see what actually ended up in the linked executable. For some reason with a /O1 build on Godbolt, it can run but won't show disassembly: Cannot open compiler generated file [...]\output.s.obj. Or no, it's just intermittently broken for me, even with your original link.)

Simpler example

For example, this bar() becomes very simple after inlining, but MSVC /Os doesn't do that even for this trivial function. In fact, code-gen is identical with no options, the default debug mode.

int foo(int a,int b){ return a+b*5;}
int bar(int x){
    return foo(3*x, 2*x);
}

; MSVC 19.32 /Os
int foo(int,int) PROC                                  ; foo
        mov     DWORD PTR [rsp+16], edx
        mov     DWORD PTR [rsp+8], ecx
        imul    eax, DWORD PTR b$[rsp], 5
        mov     ecx, DWORD PTR a$[rsp]
        add     ecx, eax
        mov     eax, ecx
        ret     0
int foo(int,int) ENDP                                  ; foo

x$ = 48
int bar(int) PROC                                 ; bar
$LN3:
        mov     DWORD PTR [rsp+8], ecx
        sub     rsp, 40                             ; 00000028H
        mov     eax, DWORD PTR x$[rsp]
        shl     eax, 1
        imul    ecx, DWORD PTR x$[rsp], 3
        mov     edx, eax
        call    int foo(int,int)                     ; foo
        add     rsp, 40                             ; 00000028H
        ret     0
int bar(int) ENDP                                 ; bar

Not just lack of inlining; note the spill of x and two reloads when computing x*2 and x*3. Same for foo, spilling its args and reloading, like a debug build. At first I thought it wasn't fully a debug build due to not using RBP as a frame pointer, but MSVC generates identical asm with no options.

vs. with a usable optimization level, MSVC -O1, where code-gen is very similar to GCC -O2 or -Os

; MSVC 19.32 -O1
x$ = 8
int bar(int) PROC                                 ; bar, COMDAT
        imul    eax, ecx, 13
        ret     0
int bar(int) ENDP                                 ; bar

a$ = 8
b$ = 16
int foo(int,int) PROC                                  ; foo, COMDAT
        lea     eax, DWORD PTR [rcx+rdx*4]
        add     eax, edx
        ret     0
int foo(int,int) ENDP                                  ; foo