Search code examples
assemblyx86micro-optimization

Multiplication with constant - imul or shl-add-combination


This question is about how we multiply an integer with a constant. So let's look at a simple function:

int f(int x) {
    return 10*x;
}

How can that function be optimized best, especially when inlined into a caller?

Approach 1 (produced by most optimizing compilers (e.g. on Godbolt))

    lea    (%rdi,%rdi,4), %eax
    add    %eax, %eax

Approach 2 (produced with clang3.6 and earlier, with -O3)

    imul   $10, %edi, %eax

Approach 3 (produced with g++6.2 without optimization, removing stores/reloads)

    mov    %edi, %eax
    sal    $2, %eax
    add    %edi, %eax
    add    %eax, %eax

Which version is fastest, and why? Primarily interested in Intel Haswell.


Solution

  • I just did some measurements. We mimic the following code in assembly, using the instructions given in the question:

        for (unsigned i = 0; i < (1 << 30); ++i) {
            r1 = r2 * 10;
            r2 = r1 * 10;
        }
    

    Possibly there are some additional registers needed for temporaries.

    Note: All measurements are in cycles per one multiplication.

    We use clang compiler with -O3, because there we exactly get the assembly we want (gcc sometimes adds very few MOVs inside the loop). We have two parameters: u = #unrolled loops, and i = #ilp. For example for u=4, i=2, we get the following:

      401d5b:   0f a2                   cpuid  
      401d5d:   0f 31                   rdtsc  
      401d5f:   89 d6                   mov    %edx,%esi
      401d61:   89 c7                   mov    %eax,%edi
      401d63:   41 89 f0                mov    %esi,%r8d
      401d66:   89 f8                   mov    %edi,%eax
      401d68:   b9 00 00 20 00          mov    $0x200000,%ecx
      401d6d:   0f 1f 00                nopl   (%rax)
      401d70:   6b d5 0a                imul   $0xa,%ebp,%edx
      401d73:   41 6b f7 0a             imul   $0xa,%r15d,%esi
      401d77:   6b fa 0a                imul   $0xa,%edx,%edi
      401d7a:   6b d6 0a                imul   $0xa,%esi,%edx
      401d7d:   6b f7 0a                imul   $0xa,%edi,%esi
      401d80:   6b fa 0a                imul   $0xa,%edx,%edi
      401d83:   6b d6 0a                imul   $0xa,%esi,%edx
      401d86:   6b f7 0a                imul   $0xa,%edi,%esi
      401d89:   6b fa 0a                imul   $0xa,%edx,%edi
      401d8c:   6b d6 0a                imul   $0xa,%esi,%edx
      401d8f:   6b f7 0a                imul   $0xa,%edi,%esi
      401d92:   6b fa 0a                imul   $0xa,%edx,%edi
      401d95:   44 6b e6 0a             imul   $0xa,%esi,%r12d
      401d99:   44 6b ef 0a             imul   $0xa,%edi,%r13d
      401d9d:   41 6b ec 0a             imul   $0xa,%r12d,%ebp
      401da1:   45 6b fd 0a             imul   $0xa,%r13d,%r15d
      401da5:   ff c9                   dec    %ecx
      401da7:   75 c7                   jne    401d70 <_Z7measureIN5timer5rtdscE2V1Li16777216ELi4ELi2EEvv+0x130>
      401da9:   49 c1 e0 20             shl    $0x20,%r8
      401dad:   49 09 c0                or     %rax,%r8
      401db0:   0f 01 f9                rdtscp 
    

    Note that these are not 8, but 16 imul instructions, because this is r2 = r1 * 10; r1 = r2 * 10; I will only post the main loop, i.e.,

      401d70:   6b d5 0a                imul   $0xa,%ebp,%edx
      401d73:   41 6b f7 0a             imul   $0xa,%r15d,%esi
      401d77:   6b fa 0a                imul   $0xa,%edx,%edi
      401d7a:   6b d6 0a                imul   $0xa,%esi,%edx
      401d7d:   6b f7 0a                imul   $0xa,%edi,%esi
      401d80:   6b fa 0a                imul   $0xa,%edx,%edi
      401d83:   6b d6 0a                imul   $0xa,%esi,%edx
      401d86:   6b f7 0a                imul   $0xa,%edi,%esi
      401d89:   6b fa 0a                imul   $0xa,%edx,%edi
      401d8c:   6b d6 0a                imul   $0xa,%esi,%edx
      401d8f:   6b f7 0a                imul   $0xa,%edi,%esi
      401d92:   6b fa 0a                imul   $0xa,%edx,%edi
      401d95:   44 6b e6 0a             imul   $0xa,%esi,%r12d
      401d99:   44 6b ef 0a             imul   $0xa,%edi,%r13d
      401d9d:   41 6b ec 0a             imul   $0xa,%r12d,%ebp
      401da1:   45 6b fd 0a             imul   $0xa,%r13d,%r15d
      401da5:   ff c9                   dec    %ecx
      401da7:   75 c7                   jne    401d70 <_Z7measureIN5timer5rtdscE2V1Li16777216ELi4ELi2EEvv+0x130>
    

    Instead of rtdsc we use perf (PERF_COUNT_HW_REF_CPU_CYCLES = "ref cycles" and PERF_COUNT_HW_CPU_CYCLES = "cpu cycles").

    We fix u = 16, and vary i = {1, 2, 4, 8, 16, 32}. We get

    name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
    V1      16    1        2.723   3.006   0.000   1.000   0.000   0.000   0.000   0.000   0.031   0.000
    V1      16    2        1.349   1.502   0.000   1.000   0.000   0.000   0.000   0.000   0.016   0.000
    V1      16    4        0.902   1.002   0.000   1.000   0.000   0.000   0.000   0.000   0.008   0.000
    V1      16    8        0.899   1.001   0.000   1.000   0.004   0.006   0.008   0.002   0.006   0.002
    V1      16    16       0.898   1.001   0.000   1.000   0.193   0.218   0.279   0.001   0.003   0.134
    V1      16    32       0.926   1.008   0.000   1.004   0.453   0.490   0.642   0.001   0.002   0.322
    

    This makes sense. The ref cycles can be ignored.

    The columns on the right show the number of microops on the execution ports. We have one instruction on p1 (the imul, of course) and on p6 we have the decrement (1/16 in the first case). Later we can also see that we get other microops due to strong register pressure.

    Ok, let's move to the second version, which is now:

      402270:   89 ea                   mov    %ebp,%edx
      402272:   c1 e2 02                shl    $0x2,%edx
      402275:   01 ea                   add    %ebp,%edx
      402277:   01 d2                   add    %edx,%edx
      402279:   44 89 fe                mov    %r15d,%esi
      40227c:   c1 e6 02                shl    $0x2,%esi
      40227f:   44 01 fe                add    %r15d,%esi
      402282:   01 f6                   add    %esi,%esi
      402284:   89 d7                   mov    %edx,%edi
      402286:   c1 e7 02                shl    $0x2,%edi
      402289:   01 d7                   add    %edx,%edi
      40228b:   01 ff                   add    %edi,%edi
      40228d:   89 f2                   mov    %esi,%edx
      40228f:   c1 e2 02                shl    $0x2,%edx
      402292:   01 f2                   add    %esi,%edx
      402294:   01 d2                   add    %edx,%edx
      402296:   89 fe                   mov    %edi,%esi
      402298:   c1 e6 02                shl    $0x2,%esi
      40229b:   01 fe                   add    %edi,%esi
      40229d:   01 f6                   add    %esi,%esi
      40229f:   89 d7                   mov    %edx,%edi
      4022a1:   c1 e7 02                shl    $0x2,%edi
      4022a4:   01 d7                   add    %edx,%edi
      4022a6:   01 ff                   add    %edi,%edi
      4022a8:   89 f2                   mov    %esi,%edx
      4022aa:   c1 e2 02                shl    $0x2,%edx
      4022ad:   01 f2                   add    %esi,%edx
      4022af:   01 d2                   add    %edx,%edx
      4022b1:   89 fe                   mov    %edi,%esi
      4022b3:   c1 e6 02                shl    $0x2,%esi
      4022b6:   01 fe                   add    %edi,%esi
      4022b8:   01 f6                   add    %esi,%esi
      4022ba:   89 d7                   mov    %edx,%edi
      4022bc:   c1 e7 02                shl    $0x2,%edi
      4022bf:   01 d7                   add    %edx,%edi
      4022c1:   01 ff                   add    %edi,%edi
      4022c3:   89 f2                   mov    %esi,%edx
      4022c5:   c1 e2 02                shl    $0x2,%edx
      4022c8:   01 f2                   add    %esi,%edx
      4022ca:   01 d2                   add    %edx,%edx
      4022cc:   89 fe                   mov    %edi,%esi
      4022ce:   c1 e6 02                shl    $0x2,%esi
      4022d1:   01 fe                   add    %edi,%esi
      4022d3:   01 f6                   add    %esi,%esi
      4022d5:   89 d7                   mov    %edx,%edi
      4022d7:   c1 e7 02                shl    $0x2,%edi
      4022da:   01 d7                   add    %edx,%edi
      4022dc:   01 ff                   add    %edi,%edi
      4022de:   89 f2                   mov    %esi,%edx
      4022e0:   c1 e2 02                shl    $0x2,%edx
      4022e3:   01 f2                   add    %esi,%edx
      4022e5:   01 d2                   add    %edx,%edx
      4022e7:   89 fe                   mov    %edi,%esi
      4022e9:   c1 e6 02                shl    $0x2,%esi
      4022ec:   01 fe                   add    %edi,%esi
      4022ee:   01 f6                   add    %esi,%esi
      4022f0:   89 d5                   mov    %edx,%ebp
      4022f2:   c1 e5 02                shl    $0x2,%ebp
      4022f5:   01 d5                   add    %edx,%ebp
      4022f7:   01 ed                   add    %ebp,%ebp
      4022f9:   41 89 f7                mov    %esi,%r15d
      4022fc:   41 c1 e7 02             shl    $0x2,%r15d
      402300:   41 01 f7                add    %esi,%r15d
      402303:   45 01 ff                add    %r15d,%r15d
      402306:   ff c8                   dec    %eax
      402308:   0f 85 62 ff ff ff       jne    402270 <_Z7measureIN5timer5rtdscE2V2Li16777216ELi4ELi2EEvv+0xe0>
    

    Measurements for V2

    name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
    V2      16    1        2.696   3.004   0.776   0.714   0.000   0.000   0.000   0.731   0.811   0.000
    V2      16    2        1.454   1.620   0.791   0.706   0.000   0.000   0.000   0.719   0.800   0.000
    V2      16    4        0.918   1.022   0.836   0.679   0.000   0.000   0.000   0.696   0.795   0.000
    V2      16    8        0.914   1.018   0.864   0.647   0.006   0.002   0.012   0.671   0.826   0.008
    V2      16    16       1.277   1.414   0.834   0.652   0.237   0.263   0.334   0.685   0.887   0.161
    V2      16    32       1.572   1.751   0.962   0.703   0.532   0.560   0.671   0.740   1.003   0.230
    

    This also makes sense, we are slower, and with i=32, we for sure have too large register pressure (shown by the other ports used and in the assembly)... With i=0, we can verify, that p0+p1+p5+p6=~3.001, so no ILP of course. We could expect 4 cpu cycles, but the MOV is for free (register allocation).

    With i=4: SHL is executed on p0 or p6, the ADD and MOV are both executed on p0, 1, 5, or 6. So we have 1 op on p0 or p6, and 2+1 ops (add/mov) on p0, p1, p5, or p6. Again, the MOV is for free, so we get 1 cycle per multiplication.

    And finally with the optimized version:

      402730:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
      402735:   01 ff                   add    %edi,%edi
      402737:   67 43 8d 2c bf          lea    (%r15d,%r15d,4),%ebp
      40273c:   01 ed                   add    %ebp,%ebp
      40273e:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
      402742:   01 db                   add    %ebx,%ebx
      402744:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
      402749:   01 ff                   add    %edi,%edi
      40274b:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
      40274f:   01 ed                   add    %ebp,%ebp
      402751:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
      402755:   01 db                   add    %ebx,%ebx
      402757:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
      40275c:   01 ff                   add    %edi,%edi
      40275e:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
      402762:   01 ed                   add    %ebp,%ebp
      402764:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
      402768:   01 db                   add    %ebx,%ebx
      40276a:   67 8d 7c ad 00          lea    0x0(%ebp,%ebp,4),%edi
      40276f:   01 ff                   add    %edi,%edi
      402771:   67 8d 2c 9b             lea    (%ebx,%ebx,4),%ebp
      402775:   01 ed                   add    %ebp,%ebp
      402777:   67 8d 1c bf             lea    (%edi,%edi,4),%ebx
      40277b:   01 db                   add    %ebx,%ebx
      40277d:   67 44 8d 64 ad 00       lea    0x0(%ebp,%ebp,4),%r12d
      402783:   45 01 e4                add    %r12d,%r12d
      402786:   67 44 8d 2c 9b          lea    (%ebx,%ebx,4),%r13d
      40278b:   45 01 ed                add    %r13d,%r13d
      40278e:   67 43 8d 2c a4          lea    (%r12d,%r12d,4),%ebp
      402793:   01 ed                   add    %ebp,%ebp
      402795:   67 47 8d 7c ad 00       lea    0x0(%r13d,%r13d,4),%r15d
      40279b:   45 01 ff                add    %r15d,%r15d
      40279e:   ff c9                   dec    %ecx
      4027a0:   75 8e                   jne    402730 <_Z7measureIN5timer5rtdscE2V3Li16777216ELi4ELi2EEvv+0x170>
    

    Measurements for V3

    name    uroll ilp   ref cyclescpu cyclesp0      p1      p2      p3      p4      p5      p6      p7      
    V3      16    1        1.797   2.002   0.447   0.558   0.000   0.000   0.000   0.557   0.469   0.000
    V3      16    2        1.273   1.418   0.448   0.587   0.000   0.000   0.000   0.528   0.453   0.000
    V3      16    4        0.774   0.835   0.449   0.569   0.000   0.000   0.000   0.548   0.442   0.000
    V3      16    8        0.572   0.636   0.440   0.555   0.017   0.021   0.032   0.562   0.456   0.012
    V3      16    16       0.753   0.838   0.433   0.630   0.305   0.324   0.399   0.644   0.458   0.165
    V3      16    32       0.976   1.087   0.467   0.766   0.514   0.536   0.701   0.737   0.458   0.333
    

    Okay, now we are faster than the imul. 2 cycles for i=0 (1 for both instructions), and for i=8, we are even faster:. The lea can be executed on p1 and p5, the add, as above, on p0, p1, p5, or p6. So if perfectly scheduled, the LEA goes to p1 and p5, the ADD to p0, or p6. Unfortunately, in this case it isn't that perfect (the assembly is fine). I suppose that scheduling is not perfect, and a few ADD land on the p1/p5 ports.

    All code can be seen on gcc.godbolt.org (it has quite some template magic, but boils down to extremely simple code, which has been checked many times). Note that I removed the timers and some other stuff, which is not necessary for checking the assembly.