c performance assembly x86-64 inline-assembly

The execution time of C calling multiple the same assembly is increasing exponentially

The following C code should simply execute p times the same assembly code, which in turns should only decrease of the ecx register in sixteen loops from 16 to 0.

When p is small, the program completes quickly, but when p is high (say p = 16), its execution time increases exponentially.

#include <stdio.h>
#include <stdlib.h>

int main() {
    int p = 16;
    int i;
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub $1, %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    return 0;
}

Strangely enough, when adding some lines to measure the execution time, the program completes as fast as expected, without any exponential increase effect:

#include <stdio.h>
#include <stdlib.h>
#include <time.h> //added

int main() {
    int p = 16;
    int i;
    clock_t start, end; //added
    start = clock(); //added
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub $1, %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    end = clock(); //added
    float time = (float)(end - start)/CLOCKS_PER_SEC; //added
    printf("Time spent: %f\n", time); //added
    return 0;
}

How to avoid such an issue?

Solution

You have mov %[c], %%rcx but c is only int. If the next four bytes following c in memory happen to be nonzero, your asm loop will execute many billions of iterations instead of just 16.

Change c to long int (or int64_t for portability to systems where long isn't 64-bit), or use mov %[c], %%ecx to zero-extend into RCX, or movsxd %[c], %%rcx to sign-extend.

Actually, there's no particular need to load rcx from memory; let the compiler do it for you by creating an input/output operand with the c constraint. Starting an asm template with mov is inefficient.

        unsigned long c = 16;
        __asm__ volatile(
            "0: \n" 
                "sub $1, %%rcx \n"
                "jnz 0b \n"
            : "+c" (c));  // "c" forces the compiler to pick RCX

Note that volatile is needed now, since the asm now has an output operand which isn't used afterward, so the compiler might otherwise optimize away the whole block. (This would also have been an issue with your original code, except that there's a special exception for asm statements with no output operands at all. I tend to not to like relying on this exception as it can be hard to remember exactly when it applies, and easy to accidentally change the code so that it no longer does. Use volatile whenever deleting the asm block would be unacceptable.)

I've also used a local label so that the code will assemble properly in case the compiler decides to unroll the loop.

Instead of hardcoding %rcx, you could use a "+r" constraint, and use dec %[c] in the loop to let the compiler choose your count register. With int c it would have picked EAX or ECX, not RCX.