I am trying to write a small library for highly optimised x86-64 bit operation code and am fiddling with inline asm.
While testing this particular case has caught my attention:
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
__asm__ ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d\n", test, bsr);
compiles and runs fine in both gcc and icc, but when I inspect the assembly I get differences
gcc -S -fverbose-asm -std=gnu99 -O3
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
movq -8(%rbp), %rax
movq %rax, -16(%rbp)
## InlineAsm Start
bsrq -16(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
I am wondering why so complicated? I am writing high performance code in which the number of instructions is critical. I am especially wondering why gcc makes a copy of my variable test
before passing it to the second inline asm?
Same code compiled with icc gives far better results:
xorl %esi, %esi # test = 0
movl $.L_2__STRING.0, %edi # has something to do with printf
orl $32832, (%rsp) # part of function initiation
xorl %eax, %eax # has something to do with printf
ldmxcsr (%rsp) # part of function initiation
btsq $39, %rsi #106.0
bsrq %rsi, %rdx #109.0
call printf #111.2
despite the fact that gcc decides to keep my variables on the stack rather then in registers, what I do not understand is why make a copy of test
before passing it to the second asm?
If I put test
in as an input/output variable in the second asm
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) , "+rm" (test) );
then those lines disappear.
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
## InlineAsm Start
bsrq -8(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
Is this gcc screwed up optimisation or am I missing some vital compiler switches? I do have icc for my production system, but if I decide to distribute the source code at some point then it will have to be able to compile with gcc too.
compilers used:
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)
icc Version 12.0.2
I've tried your example on Linux like this (making it "evil" by forcing a stack ref/loc for test
through using &test
in the printf
:):
#include <stdio.h>
int main(int argc, char **argv)
{
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
asm ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
asm ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d, &test = %p\n", test, bsr, &test);
return 0;
}
and compiled it with various versions of gcc -O3
... to the following results:
code generated gcc version ================================================================================ 400630: 48 83 ec 18 sub $0x18,%rsp 4.7.2, 400634: 31 c0 xor %eax,%eax 4.6.2, 400636: bf 50 07 40 00 mov $0x400750,%edi 4.4.6 40063b: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx 400640: 48 0f ba e8 27 bts $0x27,%rax 400645: 48 89 44 24 08 mov %rax,0x8(%rsp) 40064a: 48 89 c6 mov %rax,%rsi 40064d: 48 0f bd d0 bsr %rax,%rdx 400651: 31 c0 xor %eax,%eax 400653: e8 68 fe ff ff callq 4004c0 [ ... ] --------------------------------------------------------------------------------- 4004f0: 48 83 ec 18 sub $0x18,%rsp 4.1 4004f4: 31 c0 xor %eax,%eax 4004f6: bf 28 06 40 00 mov $0x400628,%edi 4004fb: 48 8d 4c 24 10 lea 0x10(%rsp),%rcx 400500: 48 c7 44 24 10 00 00 00 00 movq $0x0,0x10(%rsp) 400509: 48 0f ba e8 27 bts $0x27,%rax 40050e: 48 89 44 24 10 mov %rax,0x10(%rsp) 400513: 48 89 c6 mov %rax,%rsi 400516: 48 0f bd d0 bsr %rax,%rdx 40051a: 31 c0 xor %eax,%eax 40051c: e8 c7 fe ff ff callq 4003e8 [ ... ] --------------------------------------------------------------------------------- 400500: 48 83 ec 08 sub $0x8,%rsp 3.4.5 400504: bf 30 06 40 00 mov $0x400630,%edi 400509: 31 c0 xor %eax,%eax 40050b: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp) 400513: 48 89 e1 mov %rsp,%rcx 400516: 48 0f ba 2c 24 27 btsq $0x27,(%rsp) 40051c: 48 8b 34 24 mov (%rsp),%rsi 400520: 48 0f bd 14 24 bsr (%rsp),%rdx 400525: e8 fe fe ff ff callq 400428 [ ... ] --------------------------------------------------------------------------------- 4004e0: 48 83 ec 08 sub $0x8,%rsp 3.2.3 4004e4: bf 10 06 40 00 mov $0x400610,%edi 4004e9: 31 c0 xor %eax,%eax 4004eb: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp) 4004f3: 48 0f ba 2c 24 27 btsq $0x27,(%rsp) 4004f9: 48 8b 34 24 mov (%rsp),%rsi 4004fd: 48 89 e1 mov %rsp,%rcx 400500: 48 0f bd 14 24 bsr (%rsp),%rdx 400505: e8 ee fe ff ff callq 4003f8 [ ... ]
and while there's a significant difference in the created code (including whether the bsr
acceesses test
as register or memory), none of the tested revs recreate the assembly that you've shown. I'd suspect a bug in the 4.2.x version you used on MacOSX, but then I don't have either your testcase nor that specific compiler version available.
Edit: The code above is obviously different in the sense that it forces test
into the stack; if that is not done, then all "plain" gcc versions I've tested do a direct pair bts $39, %rsi
/ bsr %rsi, %rdx
.
I have found, though, that clang
creates different code there:
140: 50 push %rax 141: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp) 149: 31 f6 xor %esi,%esi 14b: 48 0f ba ee 27 bts $0x27,%rsi 150: 48 89 34 24 mov %rsi,(%rsp) 154: 48 0f bd d6 bsr %rsi,%rdx 158: bf 00 00 00 00 mov $0x0,%edi 15d: 30 c0 xor %al,%al 15f: e8 00 00 00 00 callq printf@plt>so the difference seems to be indeed between the code generators of clang/llvm and "gcc proper".