Search code examples
c++assemblyoptimizationx86-64micro-optimization

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?


My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.

Given this enum:

enum MyEnum {
    Entry1,
    Entry2,
    ...   // Entry3..27 are the same, omitted for size.
    Entry28,
    Entry29
};

And this function:

bool MyFunction(MyEnum e)
{
    if (
    e == MyEnum::Entry1 || 
    e == MyEnum::Entry3 || 
    e == MyEnum::Entry8 || 
    e == MyEnum::Entry14 || 
    e == MyEnum::Entry15 ||
    e == MyEnum::Entry18 ||
    e == MyEnum::Entry21 || 
    e == MyEnum::Entry22 ||
    e == MyEnum::Entry25)
    {
        return true;
    }
    return false;

}

For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):

bool MyFunction(MyEnum) PROC                  ; MyFunction
        cmp     ecx, 24
        ja      SHORT $LN5@MyFunction
        mov     eax, 20078725                   ; 01326085H
        bt      eax, ecx
        jae     SHORT $LN5@MyFunction
        mov     al, 1
        ret     0
$LN5@MyFunction:
        xor     al, al
        ret     0

Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:

MyFunction(MyEnum):                  # @MyFunction(MyEnum)
        cmp     edi, 24
        ja      .LBB0_2
        mov     eax, 20078725
        mov     ecx, edi
        shr     eax, cl
        and     al, 1
        ret
.LBB0_2:
        xor     eax, eax
        ret

What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.

The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:

bool MyFunction2(MyEnum e)
{
    if (
    e == MyEnum::Entry1 | 
    e == MyEnum::Entry3 |
    e == MyEnum::Entry8 |
    e == MyEnum::Entry14 |
    e == MyEnum::Entry15 |
    e == MyEnum::Entry18 |
    e == MyEnum::Entry21 |
    e == MyEnum::Entry22 |
    e == MyEnum::Entry25)
    {
        return true;
    }
    return false;

}

This results in this generated assembly with MSVC:

bool MyFunction2(MyEnum) PROC           ; MyFunction2
        xor     edx, edx
        mov     r9d, 1
        cmp     ecx, 24
        mov     eax, edx
        mov     r8d, edx
        sete    r8b
        cmp     ecx, 21
        sete    al
        or      r8d, eax
        mov     eax, edx
        cmp     ecx, 20
        cmove   r8d, r9d
        cmp     ecx, 17
        sete    al
        or      r8d, eax
        mov     eax, edx
        cmp     ecx, 14
        cmove   r8d, r9d
        cmp     ecx, 13
        sete    al
        or      r8d, eax
        cmp     ecx, 7
        cmove   r8d, r9d
        cmp     ecx, 2
        sete    dl
        or      r8d, edx
        test    ecx, ecx
        cmove   r8d, r9d
        test    r8d, r8d
        setne   al
        ret     0

Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.


Solution

  • Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.

    For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.

    >>> bin(20078725)
    '0b1001100100110000010000101'
    

    It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...

    MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).

    The C code corresponding to the clang version would be something like:

    bool MyFunction(MyEnum e) {
        if(unsigned(e) > 24) return false;
        return (20078725 >> e) & 1;
    }
    

    as for the MSVC version, it's more like

    inline bool bit_test(unsigned val, int bit) {
        return val & (1<<bit);
    }
    
    bool MyFunction(MyEnum e) {
        if(unsigned(e) > 24) return false;
        return bit_test(20078725, e);
    }
    

    (I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.


    As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).

    Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.


    Ok, doing a quick test with all the versions against all compilers, we can see that:

    • the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
    • the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
    • in general, gcc does essentially the same thing as CLang except for the switch case;
    • switch results are varied:
      • CLang does best, by generating the exact same code;
      • both gcc and MSVC generate jump-table based code, which in this case is less good; however:
        • gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
        • MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).