The following C code should simply execute p
times the same assembly code, which in turns should only decrease of the ecx
register in sixteen loops from 16 to 0.
When p
is small, the program completes quickly, but when p
is high (say p = 16
), its execution time increases exponentially.
#include <stdio.h>
#include <stdlib.h>
int main() {
int p = 16;
int i;
for(i=0; i<p; i++) {
int c = 16;
__asm__(
"mov %[c], %%rcx \n"
"loop: \n"
"sub $1, %%rcx \n"
"jnz loop \n"
:
: [c]"m" (c)
: "rcx"
);
}
return 0;
}
Strangely enough, when adding some lines to measure the execution time, the program completes as fast as expected, without any exponential increase effect:
#include <stdio.h>
#include <stdlib.h>
#include <time.h> //added
int main() {
int p = 16;
int i;
clock_t start, end; //added
start = clock(); //added
for(i=0; i<p; i++) {
int c = 16;
__asm__(
"mov %[c], %%rcx \n"
"loop: \n"
"sub $1, %%rcx \n"
"jnz loop \n"
:
: [c]"m" (c)
: "rcx"
);
}
end = clock(); //added
float time = (float)(end - start)/CLOCKS_PER_SEC; //added
printf("Time spent: %f\n", time); //added
return 0;
}
How to avoid such an issue?
You have mov %[c], %%rcx
but c
is only int
. If the next four bytes following c
in memory happen to be nonzero, your asm loop will execute many billions of iterations instead of just 16.
Change c
to long int
(or int64_t
for portability to systems where long
isn't 64-bit), or use mov %[c], %%ecx
to zero-extend into RCX, or movsxd %[c], %%rcx
to sign-extend.
Actually, there's no particular need to load rcx
from memory; let the compiler do it for you by creating an input/output operand with the c
constraint. Starting an asm template with mov
is inefficient.
unsigned long c = 16;
__asm__ volatile(
"0: \n"
"sub $1, %%rcx \n"
"jnz 0b \n"
: "+c" (c)); // "c" forces the compiler to pick RCX
Note that volatile
is needed now, since the asm
now has an output operand which isn't used afterward, so the compiler might otherwise optimize away the whole block. (This would also have been an issue with your original code, except that there's a special exception for asm
statements with no output operands at all. I tend to not to like relying on this exception as it can be hard to remember exactly when it applies, and easy to accidentally change the code so that it no longer does. Use volatile
whenever deleting the asm block would be unacceptable.)
I've also used a local label so that the code will assemble properly in case the compiler decides to unroll the loop.
Instead of hardcoding %rcx
, you could use a "+r"
constraint, and use dec %[c]
in the loop to let the compiler choose your count register. With int c
it would have picked EAX or ECX, not RCX.