Can someone tell me if 'immediate' adressing mode is more efficient than addresing through [eax] or any other way.
Lets say that I have long function with some reads and some writes (say 5 reads, five writes) to some int value in memory
mov some_addr, 12 //immediate adressing
//other code
mov eax, aome_addr
//other code
mov some_addr, eax // and so on
versus
mov eax, some_addr
mov [eax], 12 // addressing thru [eax]
//other code
mov ebx, [eax]
//other code
mov [eax], ebx // and so on
which one is faster ?
Probably the register indirect access is slightly faster, but for sure it is shorter in its encoding, for example (warning — gas syntax)
67 89 18 mov %ebx, (%eax)
67 8b 18 mov (%eax), %ebx
vs.
89 1c 25 00 00 00 00 mov %ebx, some_addr
8b 1c 25 00 00 00 00 mov some_addr, %ebx
So it has some implications while loading the instr., the use of cache etc, so that it is probably a bit faster, but in a long function with some reads and writes — I don't think it of much importance...
(The zeros in the hex code are supposed to be filled in by the linker (just to have said this).)
[update date="2012-09-30 ~21h30 CEST":
I have run some tests and I really wonder what they revealed. So much that I didn't investigate further :-)
48 8D 04 25 00 00 00 00 leaq s_var,%rax
C7 00 CC BA ED FE movl $0xfeedbacc,(%rax)
performs in most runs better than
C7 04 25 00 00 00 00 CC movl $0xfeedbacc,s_var
BA ED FE
I'm really surprised, and now I'm wondering how Maratyszcza would explain this. I have an idea already, but I'm willing ... what the fun... no, seeing these (example) results
movl to s_var
All 000000000E95F890 244709520
Avg 00000000000000E9 233
Min 00000000000000C8 200
Max 0000000000276B22 2583330
leaq s_var, movl to (reg)
All 000000000BF77C80 200768640
Avg 00000000000000BF 191
Min 00000000000000AA 170
Max 00000000001755C0 1529280
might for sure be supportive for his statement that the instruction decoder takes a max of 8 bytes per cycle, but it doesn't show how many bytes are really decoded.
In the leaq
/movl
pair, each instruction is (incl. operands) less than 8 bytes, so it is likely the case that each instruction is dispatched within one cycle, while the single movl
is to be divided into two. Still I'm convinced that it is not the decoder slowing things down, since even with the 11 byte movl
its work is done after the third byte — then it just has to wait for the pipeline streaming in the address and the immediate, both of which need no decoding.
Since this is 64 bit mode code, I also tested with the 1 byte shorter rip-relative addressing — with (almost) the same result.
Note: These measurements might heavily depend on the (micro-) architecture which they are run on. The values above are given running the testing code on an Atom N450 (constant TSC, [email protected], fixed at 1.0GHz during test run), which is unlikely to be representative for the whole x86(-64) platform.
Note: The measurements are taken at runtime, with no further analysis such as occurring task/context switches or other intervening interrupts!
/update]