I have a counter in an ISR (which is triggered by an external IRQ at 50us). The counter increments and wraps around a MAX_VAL (240).
I have the following code:
if(condition){
counter++;
counter %= MAX_VAL;
doStuff(table[counter]);
}
I am considering an alternative implementation:
if(condition){
//counter++;//probably I would increment before the comparison in production code
if(++counter >= MAX_VAL){
counter=0;
}
doStuff(table[counter]);
}
I know people recommend against trying to optimize like this, but it made me wonder. On x86 what is faster? what value of MAX_VAL would justify the second implemenation?
This gets called about every 50us so reducing the instruction set is not a bad idea. The if(++counter >= MAX_VAL) would be predicted false so it would remove the assignment to 0 in the vast majority of cases. For my purposes id prefer the consistency of the %= implementation.
As @RossRidge says, the overhead will mostly be lost in the noise of servicing an interrupt on a modern x86 (probably at least 100 clock cycles, and many many more if this is part of a modern OS with Meltdown + Spectre mitigation set up).
If MAX_VAL
is a power of 2, counter %= MAX_VAL
is excellent, especially if counter
is unsigned (in which case just a simple and
, or in this case a movzx
byte to dword which can have zero latency on Intel CPUs. It still has a throughput cost of course: Can x86's MOV really be "free"? Why can't I reproduce this at all?)
Is it possible to fill the last 255-240
entries with something harmless, or repeats of something?
As long as MAX_VAL
is a compile-time constant, though, counter %= MAX_VAL
will compile efficiently to just a couple multiplies, shift, and adds. (Again, more efficient for unsigned.) Why does GCC use multiplication by a strange number in implementing integer division?
But a check for wrap-around is even better. A branchless check (using cmov
) has lower latency than the remainder using a multiplicative inverse, and costs fewer uops for throughput.
As you say, a branchy check can take the check off the critical path entirely, at at the cost of a mispredict sometimes.
// simple version that works exactly like your question
// further optimizations assume that counter isn't used by other code in the function,
// e.g. making it a pointer or incrementing it for the next iteration
void isr_countup(int condition) {
static unsigned int counter = 0;
if(condition){
++counter;
counter = (counter>=MAX_VAL) ? 0 : counter; // gcc uses cmov
//if(counter >= MAX_VAL) counter = 0; // gcc branches
doStuff(table[counter]);
}
}
I compiled many versions of this on the Godbolt compiler explorer, with recent gcc and clang.
(For more about static performance analysis of throughput and latency for short blocks of x86 asm, see What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?, and other links in the x86 tag wiki, especially Agner Fog's guides.)
clang uses branchless cmov
for both versions. I compiled with -fPIE
in case you're using that in your kernels. If you can use -fno-pie
, then the compiler can save an LEA and use mov edi, [table + 4*rcx]
, assuming you're on a target where static addresses in position-dependent code fit in 32-bit sign-extended constants (e.g. true in the Linux kernel, but I'm not sure if they compile with -fPIE
or do kernel ASLR with relocations when the kernel is loaded.)
# clang7.0 -O3 -march=haswell -fPIE.
# gcc's output is the same (with different registers), but uses `mov edx, 0` before the cmov for no reason, because it's also before a cmp that sets flags
isr_countup: # @isr_countup
test edi, edi
je .LBB1_1 # if condition is false
mov eax, dword ptr [rip + isr_countup.counter]
add eax, 1 # counter++
xor ecx, ecx
cmp eax, 239 # set flags based on (counter , MAX_VAL-1)
cmovbe ecx, eax # ecx = (counter <= MAX_VAL-1) ? 0 : counter
mov dword ptr [rip + isr_countup.counter], ecx # store the old counter
lea rax, [rip + table]
mov edi, dword ptr [rax + 4*rcx] # index the table
jmp doStuff@PLT # TAILCALL
.LBB1_1:
ret
The block of 8 instructions starting at the load of the old counter value is a total of 8 uops (on AMD, or Intel Broadwell and later, where cmov
is only 1 uop). The critical-path latency from counter
being ready to table[++counter % MAX_VAL]
being ready is 1 (add) + 1 (cmp) + 1 (cmov) + load-use latency for the load. i.e. 3 extra cycles. That's the latency of 1 mul
instruction. Or 1 extra cycle on older Intel where cmov
is 2 uops.
By comparison, the version using modulo is 14 uops for that block with gcc, including a 3-uop mul r32
. The latency is at least 8 cycles, I didn't count exactly. (For throughput it's only little bit worse, though, unless the higher latency reduces the ability of out-of-order execution to overlap execution of stuff that depends on the counter.)
Other optimizations
Use the old value of counter
, and prepare a value for next time (taking the calculation off the critical path.)
Use a pointer instead of a counter. Saves a couple instructions, at the cost of using 8 bytes instead of 1 or 4 of cache footprint for the variable. (uint8_t counter
compiles nicely with some versions, just using movzx
to 64-bit).
This counts upward, so the table can be in order. It increments after loading, taking that logic off the critical path dependency chain for out-of-order execution.
void isr_pointer_inc_after(int condition) {
static int *position = table;
if(condition){
int tmp = *position;
position++;
position = (position >= table + MAX_VAL) ? table : position;
doStuff(tmp);
}
}
This compiles really nicely with both gcc and clang, especially if you're using -fPIE
so the compiler needs the table address in a register anyway.
# gcc8.2 -O3 -march=haswell -fPIE
isr_pointer_inc_after(int):
test edi, edi
je .L29
mov rax, QWORD PTR isr_pointer_inc_after(int)::position[rip]
lea rdx, table[rip+960] # table+MAX_VAL
mov edi, DWORD PTR [rax] #
add rax, 4
cmp rax, rdx
lea rdx, -960[rdx] # table, calculated relative to table+MAX_VAL
cmovnb rax, rdx
mov QWORD PTR isr_pointer_inc_after(int)::position[rip], rax
jmp doStuff(int)@PLT
.L29:
ret
Again, 8 uops (assuming cmov
is 1 uop). This has even lower latency than the counter version possibly could, because the [rax]
addressing mode (with RAX coming from a load) has 1 cycle lower latency than an indexed addressing mode, on Sandybridge-family. With no displacement, it never suffers the penalty described in Is there a penalty when base+offset is in a different page than the base?
Or (with a counter) count down towards zero: potentially saves an instruction if the compiler can use flags set by the decrement to detect the value becoming negative. Or we can always use the decremented value, and do the wrap around after that: so when counter
is 1, we'd use table[--counter]
(table[0]
), but store counter=MAX_VAL
. Again, takes the wrap check off the critical path.
If you wanted a branch version, you'd want it to branch on the carry flag, because sub eax,1
/ jc
can macro-fuse into 1 uops, but sub eax,1
/ js
can't macro-fuse on Sandybridge-family.
x86_64 - Assembly - loop conditions and out of order. But with branchless, it's fine. cmovs
(mov if sign flag set, i.e. if the last result was negative) is just as efficient as cmovc
(mov if carry flag is set).
It was tricky to get gcc to use the flag results from dec or sub without also doing a cdqe
to sign-extend the index to pointer width. I guess I could use intptr_t
counter, but that would be silly; might as well just use a pointer. With an unsigned counter, gcc and clang both want to do another cmp eax, 239
or something after the decrement, even though flags are already set just fine from the decrement. But we can get gcc to use SF by checking (int)counter < 0
:
// Counts downward, table[] entries need to be reversed
void isr_branchless_dec_after(int condition) {
static unsigned int counter = MAX_VAL-1;
if(condition){
int tmp = table[counter];
--counter;
counter = ((int)counter < 0) ? MAX_VAL-1 : counter;
//counter = (counter >= MAX_VAL) ? MAX_VAL-1 : counter;
//counter = (counter==0) ? MAX_VAL-1 : counter-1;
doStuff(tmp);
}
}
# gcc8.2 -O3 -march=haswell -fPIE
isr_branchless_dec_after(int):
test edi, edi
je .L20
mov ecx, DWORD PTR isr_branchless_dec_after(int)::counter[rip]
lea rdx, table[rip]
mov rax, rcx # stupid compiler, this copy is unneeded
mov edi, DWORD PTR [rdx+rcx*4] # load the arg for doStuff
mov edx, 239 # calculate the next counter value
dec eax
cmovs eax, edx
mov DWORD PTR isr_branchless_dec_after(int)::counter[rip], eax # and store it
jmp doStuff(int)@PLT
.L20:
ret
still 8 uops (should be 7), but no extra latency on the critical path. So all of the extra decrement and wrap instructions are juicy instruction-level parallelism for out-of-order execution.