Inline assembly with chunks of a char array as memory output operands

I am executing cpuid(leaf 0) that gives me the vendor string. The code (under block1) works fine and displays GenuineIntel just as I expect. In asm block2 below I want to directly map the ebx, edx, ecx values to the vendor array instead of using explicit mov instructions.

Currently I am trying to move the resulting ebx value (four bytes) into the first four bytes of the vendor array. This displays a value of G on the screen which is the first byte of ebx.

I tried casting to uint32_t* and that gives a build error lvalue required in asm statement.

I want to understand what changes should be made to the code for it to write the first four bytes to the vendor array? Is there a way to do this without using the explicit mov instructions? Any help is appreciated. Thank you.

#include <iostream>
#include <cstdint>
using namespace std;

const int VENDORSIZE = 12;
int main(int argc, char **argv)
{
    char vendor[VENDORSIZE +1]{};
    uint32_t leaf = 0;
    vendor[VENDORSIZE] = '\0';
    // Block 1
    /*asm volatile(
        "cpuid\n"
        "mov %%ebx, %0\n"
        "mov %%edx, %1\n"
        "mov %%ecx, %2\n"
        :"=m"(vendor[0]),"=m"(vendor[4]),"=m"(vendor[8])
        :"a"(leaf)
        :
    );*/

    // Block 2
    asm volatile(
    "cpuid\n"
    :"=b"(*vendor)
    :"a"(leaf)
    :
   );
    
    cout << vendor<< endl;
    return 0;
}

My try with cast:

// Block 2
    asm volatile(
    "cpuid\n"
    :"=b"((uint32_t*) vendor)
    :"a"(leaf)
    :
   );

This generates an error:

cpuid.cpp:28:5: error: invalid lvalue in asm output 0

Based on Peter Corde's link below - I added the missing dereference. The code below now outputs GenuineIntel. I sincerely appreciate the help.

// Block 2
    asm volatile(
    "cpuid\n"
    :"=b"(*(uint32_t*)vendor),"=d"(*(uint32_t*)(vendor+4)),"=c"(*(uint32_t*)(vendor+8))
    :"a"(leaf)
    :
   );

Solution

First of all, for actually using cpuid, prefer using intrinsic wrappers like __get_cpuid from GCC's cpuid.h, or GNU C builtin functions.

How do I call "cpuid" in Linux? for __get_cpuid
Intrinsics for CPUID like informations? for stuff like __builtin_cpu_supports("avx")
https://wiki.osdev.org/CPUID

The rest of this answer is just using CPUID as an example to talk about chunks of chars and arrays as operands to GNU C inline asm, and other points of correctness.

*vendor has type char, so you've asked the compiler to take BL as the value of vendor[0] (aka *vendor) after your asm instructions run. That's why it only stores the G, the low byte of EBX.

You can see this if you look at the compiler-generated asm https://godbolt.org/z/5bva6zvvK and note the movb %bl, 2(%rsp)

Other bugs in your asm:

You don't tell the compiler that EAX is modified by the asm statement, instead telling the compiler "a"(0) is a pure input.
Your Block 1 (with mov stores) fails to tell the compiler about EBX, ECX, and EDX being clobbered, too. Using "=b", "=c", and "=d" outputs would fix that.
The version with "=m"(vendor[0]), vendor[4] etc is only telling the compiler that bytes 0, 4, and 8 of the array were modified, not bytes 1..3 or 5..7. So the asm stores to memory you haven't told the compiler is an output. It's unlikely to be a problem in practice, but see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for ways to declare a whole array as an output.

Also, volatile is overkill / unnecessary here. CPUID leaf 0 (and I think other leaves) will always give you the same result every time, and the whole asm statement has no side effects beyond writing its outputs operands, so it's a pure function of its input operand. That's what non-volatile asm implies. (Assuming you don't need it to do double duty as a serializing instruction or memory barrier for some reason.) Unlikely to matter since you hopefully wouldn't write code that ran this statement in a loop anyway; CPUID is slow so you'd want to cache the results, not rely on common-subexpression-elimination. I guess it could be useful to let this optimize away if you didn't actually print the result at all.

e.g. safe code using mov inside the asm template would look like this:

const int VENDORSIZE = 12;
int main1()
{
    char vendor[VENDORSIZE+2];
    int leaf = 0;
    asm (   // doesn't need to be volatile; we'll get the same result for eax=0 every time
        "cpuid\n"
        "mov %%ebx, %0\n"
        "mov %%edx, 4 + %0\n"
        "mov %%ecx, 8 + %0\n"
        : "=m"(vendor)    // the whole local array is an output.
                       //  Only works for true arrays; pointers need casting to array
          ,"+a"(leaf)  // EAX is modified, too
        :  // no pure inputs
        : "ebx", "ecx", "edx"  // Tell compiler about registers we destroyed.
    );
    vendor[VENDORSIZE+0] = '\n';
    vendor[VENDORSIZE+1] = '\0';
    std::cout << vendor;     // std::endl is pointless here
                             // so just make room for \n in the array
                             // instead of a separate << '\n'  function call.
    return 0;
}

I used the whole array (vendor) as a memory output operand, instead of instead of *vendor, vendor[4], etc. The optimized asm will be the same, but with optimization disabled the 3-output way might have generated 3 separate pointers. More importantly, it solves the problem of telling the compiler about each and every by that gets written.

It's also telling the compiler that the whole array is written, not just the first 12 bytes, so if I had assigned the '\n' and '\0' before the asm statement, the compiler could legally remove them as dead stores. (It doesn't, but I think it could with "=m"(vendor) instead of "+m".)

AT&T syntax has the nice property that memory addressing modes are offsettable, so 4 + %0 expands to something like 4 + 2(%rsp) which is just 6(%rsp). If the compiler happens to pick an addressing mode without a number like (%rsp), GAS does accept 4 + (%rsp) as equivalent to 4(%rsp), although with a warning like Warning: missing operand; zero assumed.

If this was in a function that took a char* arg so you only had a pointer, not an actual C array, you'd have to cast to pointer-to-array and dereference. This looks like it would violate strict-aliasing, but it's actually what the GCC manual recommends. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

    ...  // if vendor is just a char* function arg

    : "=m"( *(char (*)[VENDORSIZE]) vendor )   
      // tells the compiler that we write 12 bytes
      // With empty [], would tell the compiler we might write an arbitrary size starting at that pointer.

Using register output operands

"=b"( *(uint32_t*)&vendor[0] ) would work, but violates the strict-aliasing rule with that pointer cast, accessing char objects through a uint32_t *. It happens to work in current GCC/clang, but wouldn't be truly safe / supported unless you compiled with -fno-strict-aliasing.

Example on Godbolt (also including the mov version and the below uint32_t[] version) showing that it compiles and runs correctly (with GCC, clang, and ICC.)

    // works but violates strict-aliasing
    char vendor[VENDORSIZE + 2];

    asm( "cpuid"
    : "+a"(leaf),       // read/write operand
      "=b"( *(uint32_t*)&vendor[0] ),   // strict-aliasing violation in the pointer cast
      "=d"( *(uint32_t*)&vendor[4] ),
      "=c"( *(uint32_t*)&vendor[8] )
     // no pure inputs, no clobbers
   );

You can legally point a char* at anything, but it's not strictly safe to point other things at char objects. If vendor was a pointer to memory you got from malloc or something, there would be no underlying type for the memory, just access via uint32_t* and later reading via char * so it would be safe. But for an actual array, I think it's not, even though array accesses work in terms of pointer deref.

You can declare the array as uint32_t, and then use char * access to those bytes:

Fully safe version

int main3()  // fully safe without strict-aliasing violations.
{
    uint32_t vendor[VENDORSIZE/sizeof(uint32_t) + 1];  // wastes 2 bytes
    int leaf = 0;
    asm( "cpuid"
     : "+a"(leaf),      // read/write operand, compiler needs to know that CPUID writes EAX
       "=b"( vendor[0] ),  // ask the compiler to assign to the array
       "=d"( vendor[1] ),
       "=c"( vendor[2] )
      // no pure inputs, no clobbers
    );
    
    vendor[3] = '\n';  // x86 is little-endian so the \0 terminator is part of this.
    std::cout << reinterpret_cast<const char*>(vendor);
    return 0;
}

Is this "better"? It fully avoids any undefined behaviour, at the cost of wasting 2 bytes (16 byte array vs. 14). Otherwise compiles identically (except for a dword store with the newline which is probably actually better, given how GCC uses two instructions to make sure to avoid an LCP stall on pre-Sandybridge CPUs). Taking a char* that points to a uint32_t[] is legal, and so is dereferencing it, so it's fully safe to pass it to a function like cout::operator<<.

It also seems fairly human-readable: you're basically getting chunks of uint32_t from CPUID, and reinterpreting those bytes as a character array, so the semantic meaning of the code as written does fairly nicely show what's going on. Tacking on the '\n' is slightly non-obvious, but ((char*)vendor)[12] = '\n'; / ... [13] = 0;` could make it clearer.

I don't know how likely / unlikely it is for the C++ UB (strict-aliasing violation on a char[] array) in the pointer-cast version to ever cause a problem on any future compiler. I'm pretty confident it's fine on current GCC/clang/ICC even after inlining into complex surrounding code that reuses the array for other things before / after.

If you were writing portable inline asm for a bi-endian architecture (or simply on a big-endian machine), you might memcpy(vendor+3, "\n", 2), or cast to char* for the assignments to make sure you store the chars at the right byte offsets. Of course the whole idea of storing registers to a char array would depend on the 4 chars per register being in an order that matches the current endianness.

Misc other parts of the question

I tried casting to uint32_t* and that gives a build error lvalue required in asm statement.

Presumably you put your cast somewhere else or left out some dereferencing since the compiler complained about an rvalue instead of an lvalue. The C++ expression you put inside the parens has to be the C++ object you want to assign to, even for a "=m" memory operand. That's why you used vendor[4] not vendor+4 in the first version.

directly map the ebx, edx, ecx values to the vendor array

Keep in mind that if the compiler needs them in memory (e.g. when it passes vendor to cout::operator<<(char*)), it's going to have to emit mov store instructions after your asm template. The mapping between C++ variables and operand locations is just like an = assignment, and in this case you're not saving asm instructions.

You would be saving instructions if you were doing vendor[0] == 'G' or something, or a memcmp that could inline; the compiler could just check bl or ebx instead of storing and then using a memory operand.

But in general yes it's a good idea to let the compiler handle data movement, keeping your asm template minimal and just telling the compiler where the inputs and outputs are. I just wanted to be clear about what "directly map" does and doesn't mean. It's often a good idea to look at the compiler-generated asm around your asm template string (and to check what it picked).