c assembly clang inline-assembly gnu-assembler

Inline Assembly Causing Errors about No Prefixes

Hello,

So, I'm optimizing some functions that I wrote for a simple operating system I'm developing. This function, putpixel(), currently looks like this (in case my assembly is unclear or wrong):

uint32_t loc  = (x*pixel_w)+(y*pitch);
vidmem[loc]   = color & 255;
vidmem[loc+1] = (color >> 8) & 255;
vidmem[loc+2] = (color >> 16) & 255;

This takes a little bit of explanation. First, loc is the pixel index I want to write to in video memory. X and Y coordinates are passed to the function. Then, we multiply X by the pixel width in bytes (in this case, 3) and Y by the number of bytes in each line. More information can be found here.

vidmem is a global variable, a uint8_t pointer to video memory.

That being said, anyone familiar with bitwise operations should be able to figure out how putpixel() works fairly easily.

Now, here's my assembly. Note that it has not been tested and may even be slower or just plain not work. This question is about how to make it compile.

I've replaced everything after the definition of loc with this:

__asm(
    "push %%rdi;"
    "push %%rbx;"
    "mov %0, %%rdi;"
    "lea %1, %%rbx;" 
    "add %%rbx, %%rdi;"
    "pop %%rbx;"
    "mov %2, %%rax;"
    "stosb;"
    "shr $8, %%rax;"
    "stosb;"
    "shr $8, %%rax;"
    "stosb;"
    "pop %%rdi;" : :
    "r"(loc), "r"(vidmem), "r"(color)
);

When I compile this, clang gives me this error for every push instruction:

So when I saw that error, I assumed it had to do with my omission of the GAS suffixes (which should have been implicitly decided on, anyway). But when I added the "l" suffix (all of my variables are uint32_ts), I got the same error! I'm not quite sure what's causing it, and any help would be much appreciated. Thanks in advance!

Solution

You could probably make the compiler's output for your C version much more efficient by loading vidmem into a local variable before the stores. As it is, it can't assume that the stores don't alias vidmem, so it reloads the pointer before every byte store. Hrm, that does let gcc 4.9.2 avoid reloading vidmem, but it still generates some nasty code. clang 3.5 does slightly better.

Implementing what I said in my comment on your answer (that stos is 3 uops vs. 1 for mov):

#include <stdint.h>

extern uint8_t *vidmem;
void putpixel_asm_peter(uint32_t color, uint32_t loc)
{
    // uint32_t loc  = (x*pixel_w)+(y*pitch);
    __asm(  "\n"
        "\t movb %b[col], (%[ptr])\n"
        "\t shrl $8, %[col];\n"
        "\t movw %w[col], 1(%[ptr]);\n"
        : [col] "+r" (color),  "=m" (vidmem[loc])
        : [ptr] "r" (vidmem+loc)
        :
        );
}

compiles to a very efficient implementation:

gcc -O3 -S -o- putpixel.c 2>&1 | less  # (with extra lines removed)

putpixel_asm_peter:
        movl    %esi, %esi
        addq    vidmem(%rip), %rsi
#APP
        movb %dil, (%rsi)
        shrl $8, %edi;
        movw %di, 1(%rsi);
#NO_APP
        ret

All of those instructions decode to a single uop on Intel CPUs. (The stores can micro-fuse, because they use a single-register addressing mode.) The movl %esi, %esi zeroes the upper 32, since the caller might have generated that function arg with a 64bit instruction the left garbage in the high 32 of %rsi. Your version could have saved some instructions by using constraints to ask for the values in the desired registers in the first place, but this will still be faster than stos

Also notice how I let the compiler take care of adding loc to vidmem. You could have done it more efficiently in yours, with a lea to combine an add with a move. However, if the compiler wants to get clever when this is used in a loop, it could increment the pointer instead of the address. Finally, this means the same code will work for 32 and 64bit. %[ptr] will be a 64bit reg in 64bit mode, but a 32bit reg in 32bit mode. Since I don't have to do any math on it, it Just Works.

I used =m output constraint to tell the compiler where we're writing in memory. (I should have cast the pointer to a struct { char a[3]; } or something, to tell gcc how much memory it actually writes, as per the tip at the end of the "Clobbers" section in the gcc manual)

I also used color as an input/output constraint to tell the compiler that we modify it. If this got inlined, and later code expected to still find the value of color in the register, we'd have a problem. Having this in a function means color is already a tmp copy of the caller's value, so the compiler will know it needs to throw away the old color. Calling this in a loop could be slightly more efficient with two read-only inputs: one for color, one for color >> 8.

Note that I could have written the constraints as

    : [col] "+r" (color), [memref] "=m" (vidmem[loc])
    :
    :

But using %[memref] and 1 %[memref] to generate the desired addresses would lead gcc to emit

    movl    %esi, %esi
    movq    vidmem(%rip), %rax
# APP
    movb %edi, (%rax,%rsi)
    shrl $8, %edi;
    movw %edi, 1 (%rax,%rsi);

The two-reg addressing mode means the store instructions can't micro-fuse (on Sandybridge and later, at least).

You don't even need inline asm to get decent code, though:

void putpixel_cast(uint32_t color, uint32_t loc)
{
    // uint32_t loc  = (x*pixel_w)+(y*pitch);
    typeof(vidmem) vmem = vidmem;
    vmem[loc]   = color & 255;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
    *(uint16_t *)(vmem+loc+1) = color >> 8;
#else
    vmem[loc+1] = (color >> 8) & 255; // gcc sucks at optimizing this for little endian :(
    vmem[loc+2] = (color >> 16) & 255;
#endif
}

compiles to (gcc 4.9.2 and clang 3.5 give the same output):

    movq    vidmem(%rip), %rax
    movl    %esi, %esi
    movb    %dil, (%rax,%rsi)
    shrl    $8, %edi
    movw    %di, 1(%rax,%rsi)
    ret

This is only a tiny bit less efficient than what we get with inline asm, and should be easier for the optimizer to optimize if inlined into loops.

Overall performance

Calling this in a loop is probably a mistake. It'll be more efficient to combine multiple pixels in a register (esp. a vector register), and then write all at once. Or, do 4-byte writes, overlapping the last byte of the previous write, until you get to the end and have to preserve the byte after the last chunk of 3.

See http://agner.org/optimize/ for more stuff about optimizing C and asm. That and other links can be found at https://stackoverflow.com/tags/x86/info.