optimization assembly x86 cpu addressing-mode

addressing mode efficiency

Can someone tell me if 'immediate' adressing mode is more efficient than addresing through [eax] or any other way.

Lets say that I have long function with some reads and some writes (say 5 reads, five writes) to some int value in memory

     mov some_addr, 12      //immediate adressing
     //other code
     mov eax, aome_addr
     //other code
     mov some_addr, eax    // and so on

versus

     mov eax, some_addr

     mov [eax], 12      // addressing thru [eax]
     //other code
     mov ebx, [eax]
     //other code
     mov [eax], ebx    // and so on

which one is faster ?

Solution

Probably the register indirect access is slightly faster, but for sure it is shorter in its encoding, for example (warning — gas syntax)

67 89 18                   mov %ebx, (%eax)
67 8b 18                   mov (%eax), %ebx

vs.

89 1c 25 00 00 00 00       mov %ebx, some_addr
8b 1c 25 00 00 00 00       mov some_addr, %ebx

So it has some implications while loading the instr., the use of cache etc, so that it is probably a bit faster, but in a long function with some reads and writes — I don't think it of much importance...

(The zeros in the hex code are supposed to be filled in by the linker (just to have said this).)

[update date="2012-09-30 ~21h30 CEST":

I have run some tests and I really wonder what they revealed. So much that I didn't investigate further :-)

48 8D 04 25 00 00 00 00    leaq s_var,%rax
C7 00 CC BA ED FE          movl $0xfeedbacc,(%rax)

performs in most runs better than

C7 04 25 00 00 00 00 CC    movl $0xfeedbacc,s_var
BA ED FE

I'm really surprised, and now I'm wondering how Maratyszcza would explain this. I have an idea already, but I'm willing ... what the fun... no, seeing these (example) results

movl to s_var
All 000000000E95F890 244709520
Avg 00000000000000E9 233
Min 00000000000000C8 200
Max 0000000000276B22 2583330
leaq s_var, movl to (reg)
All 000000000BF77C80 200768640
Avg 00000000000000BF 191
Min 00000000000000AA 170
Max 00000000001755C0 1529280

might for sure be supportive for his statement that the instruction decoder takes a max of 8 bytes per cycle, but it doesn't show how many bytes are really decoded.

In the leaq/movl pair, each instruction is (incl. operands) less than 8 bytes, so it is likely the case that each instruction is dispatched within one cycle, while the single movl is to be divided into two. Still I'm convinced that it is not the decoder slowing things down, since even with the 11 byte movl its work is done after the third byte — then it just has to wait for the pipeline streaming in the address and the immediate, both of which need no decoding.

Since this is 64 bit mode code, I also tested with the 1 byte shorter rip-relative addressing — with (almost) the same result.

Note: These measurements might heavily depend on the (micro-) architecture which they are run on. The values above are given running the testing code on an Atom N450 (constant TSC, boot@1.6GHz, fixed at 1.0GHz during test run), which is unlikely to be representative for the whole x86(-64) platform.

Note: The measurements are taken at runtime, with no further analysis such as occurring task/context switches or other intervening interrupts!

/update]