Search code examples
optimizationassemblyx86cpuaddressing-mode

addressing mode efficiency


Can someone tell me if 'immediate' adressing mode is more efficient than addresing through [eax] or any other way.

Lets say that I have long function with some reads and some writes (say 5 reads, five writes) to some int value in memory

     mov some_addr, 12      //immediate adressing
     //other code
     mov eax, aome_addr
     //other code
     mov some_addr, eax    // and so on

versus

     mov eax, some_addr

     mov [eax], 12      // addressing thru [eax]
     //other code
     mov ebx, [eax]
     //other code
     mov [eax], ebx    // and so on

which one is faster ?


Solution

  • Probably the register indirect access is slightly faster, but for sure it is shorter in its encoding, for example (warning — gas syntax)

    67 89 18                   mov %ebx, (%eax)
    67 8b 18                   mov (%eax), %ebx
    

    vs.

    89 1c 25 00 00 00 00       mov %ebx, some_addr
    8b 1c 25 00 00 00 00       mov some_addr, %ebx
    

    So it has some implications while loading the instr., the use of cache etc, so that it is probably a bit faster, but in a long function with some reads and writes — I don't think it of much importance...

    (The zeros in the hex code are supposed to be filled in by the linker (just to have said this).)

    [update date="2012-09-30 ~21h30 CEST":

    I have run some tests and I really wonder what they revealed. So much that I didn't investigate further :-)

    48 8D 04 25 00 00 00 00    leaq s_var,%rax
    C7 00 CC BA ED FE          movl $0xfeedbacc,(%rax)
    

    performs in most runs better than

    C7 04 25 00 00 00 00 CC    movl $0xfeedbacc,s_var
    BA ED FE
    

    I'm really surprised, and now I'm wondering how Maratyszcza would explain this. I have an idea already, but I'm willing ... what the fun... no, seeing these (example) results

    movl to s_var
    All 000000000E95F890 244709520
    Avg 00000000000000E9 233
    Min 00000000000000C8 200
    Max 0000000000276B22 2583330
    leaq s_var, movl to (reg)
    All 000000000BF77C80 200768640
    Avg 00000000000000BF 191
    Min 00000000000000AA 170
    Max 00000000001755C0 1529280
    

    might for sure be supportive for his statement that the instruction decoder takes a max of 8 bytes per cycle, but it doesn't show how many bytes are really decoded.

    In the leaq/movl pair, each instruction is (incl. operands) less than 8 bytes, so it is likely the case that each instruction is dispatched within one cycle, while the single movl is to be divided into two. Still I'm convinced that it is not the decoder slowing things down, since even with the 11 byte movl its work is done after the third byte — then it just has to wait for the pipeline streaming in the address and the immediate, both of which need no decoding.

    Since this is 64 bit mode code, I also tested with the 1 byte shorter rip-relative addressing — with (almost) the same result.

    Note: These measurements might heavily depend on the (micro-) architecture which they are run on. The values above are given running the testing code on an Atom N450 (constant TSC, [email protected], fixed at 1.0GHz during test run), which is unlikely to be representative for the whole x86(-64) platform.

    Note: The measurements are taken at runtime, with no further analysis such as occurring task/context switches or other intervening interrupts!

    /update]