I've recently seen an article on how the swap operation can be performed using xor'ing instead of using a temporary variable. When I compile code using int a ^= b;
the result won't simply be(for at&t syntax)
xor b, a
etc.
instead it will load the raw values into registers, xor it and write it back. To optimize this i want to write this in inline assembly so it only uses three ticks to do the entire thing and not 15 like it does normally.
I've tried multiple keywords like:
asm(...);
asm("...");
asm{...};
asm{"..."};
asm ...
__asm ...
None of that worked, either giving me a syntax error, gcc doesn't seem to accept all of that syntax or else saying
main.cpp: Assembler messages:
main.cpp:12: Error: too many memory references for `xor'
Basically, I want to use the variables defined in my c++ code used in the assembler block, using three lines to xor them and then have my swapped variables basically like this:
int main() {
volatile int a = 5;
volatile int b = 6;
asm {
xor a,b
xor b,a
xor a,b
};
//a should now be 6, b should be 5
}
To clarify: I want to avoid using the compiler generated mov operations since they take more cpu cycles than just doing three xor operations which would take three cycles. How could I accomplish this?
To use inline assembly, you should use __asm__ volatile
. However, this type of optimization may be premature. Just because there are more instructions does not mean the code is slower - some instructions can be really slow. For example, a floating point BCD store instruction (fbstp
), while admittedly rare, takes over 200 cycles - compared to one cycle for a simple mov
(Agner Fog's Optimization Guide is a good resource for these timings).
So, I implemented a bunch of "swap" functions, some in C++ and some in assembly, and did a bit of measuring, running each function 100 million times in a row.
std::swap
std::swap
is probably the preferred solution here. It does what you want (swap the values of two variables), works for most standard library types and not just for integers, clearly communicates what you are trying to achieve, and is portable across architectures.
void std_swap(int *a, int *b) {
std::swap(*a, *b);
}
Here is the generated assembly: It loads both values into registers, and then writes them back to the opposite memory locations.
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
This is what you were trying to do, in C++:
void xor_swap(int *a, int *b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
This doesn't directly translate to only xor
instructions, because there is no instruction on x86 that allows you to directly xor
two locations in memory - you always need to load at least one of the two into a register:
movl (%rdi), %eax
xorl (%rsi), %eax
movl %eax, (%rdi)
xorl (%rsi), %eax
movl %eax, (%rsi)
xorl %eax, (%rdi)
You also generate a bunch of extra instructions because the two pointers may alias, i.e. point to overlapping memory areas. Then, changing one variable would also change the other, so the compiler needs to constantly store and re-load the values. An implementation using the compiler-specific __restrict
keyword will compile to the same code as std_swap
(thanks to @Ped7g for pointing out this flaw in the comments).
This is the "standard" swap with a temporary variable (that the compiler promptly optimizes out to the same code as std::swap
):
void tmp_swap(int *a, int *b) {
int tmp = *a;
*a = *b;
*b = tmp;
}
xchg
instructionxchg
can swap a memory value with a register value - it seems perfect at first for your use case. However, it is really slow when you use it to access memory, as you will see later.
void xchg_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xchgl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
We need to load one of the two values into a register, because there is no xchg
for two memory locations.
I made two versions of the XOR-based swap in Assembly. The first one only loads one of the values in a register, the second loads both before swapping them and writing them back.
void xor_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xorl (%1), %%eax\n\t"
"xorl %%eax, (%1)\n\t"
"xorl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
void xor_asm_register_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"movl (%1), %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"xorl %%eax, %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"movl %%eax, (%0)\n\t"
"movl %%ecx, (%1)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax", "%ecx"
);
}
You can view the full compilation results along with the generated assembly code on Godbolt.
On my machine, the timings (in microseconds) vary a bit, but are generally comparable:
std_swap: 127371
xor_swap: 150152
tmp_swap: 125896
xchg_asm_swap: 699355
xor_asm_swap: 130586
xor_asm_register_swap: 124718
You can see that std_swap
, tmp_swap
, xor_asm_swap
, and xor_asm_register_swap
are generally very similar in speed - in fact, if I move xor_asm_register_swap
to the front, it turns out slightly slower than std_swap
. Also note that tmp_swap
is exactly the same assembly code as std_swap
(although it regularly measures in as a bit faster, probably because of the ordering).
xor_swap
implemented in C++ is slightly slower because the compiler generates an additional memory load/store for each of the instructions because of aliasing - as mentioned above, if we modify xor_swap
to take int * __restrict a, int * __restrict b
instead (meaning that a
and b
never alias), the compiler generates the same code as for std_swap
and tmp_swap
.
xchg_swap
, despite using the lowest number of instructions, is terribly slow (over four times slower than any of the other options), just because xchg
is not a fast operation if it involves a memory access.
Ultimately, you have the choice between using some custom assembly-based version (that is hard to understand and maintain) or just using std::swap
(which is pretty much the opposite, and also benefits from any optimizations that the standard library designers can come up with, e.g. using vectorization on larger types). Since this is over one hundred million iterations, it should be clear that the potential improvement by using assembly code here is very small - if you improve at all (which is not clear) you'd shave off a couple of microseconds at most.
TL;DR: You shouldn't do that, just use std::swap(a, b)
__asm__ volatile
I figured that it may make sense at this point to explain the inline assembly code a bit. __asm__
(in GNU mode, asm
is enough) introduces a block of assembly code. The volatile
is there to make sure the compiler doesn't optimize it away - it likes to just remove the block otherwise.
There are two forms of __asm__ volatile
. One of them also deals with goto
labels; I will not address it here. The other form takes up to four arguments, separated with colons (:
):
__asm__ volatile ("rdtsc")
) just dumps the assembly code, but does not really interact with the C++ code around it. In particular, you need to guess how variables are assigned to registers, which is not exactly good."\n"
, because this assembly code is passed verbatim to the GNU assembler (gas
).=r
means "any register operand", and +r
means "any register operand, but it is also used as an input"). For example, : "+r" (a), "+r" (b)
tells the compiler to replace %0
(references the first of the operands) with the register containing a
, and %1
with the register containing b
.%eax
(as you would normally reference eax
in AT&T assembly notation) with %%eax
to escape the percentage sign.".intel_syntax\n"
to switch to Intel's assembly syntax if you prefer."memory"
will likely prompt the compiler to insert a full memory fence. You can see that I added all the registers I used for temporary storage to this list.