How to have GCC combine "move r10, r3; store r10" into a "store r3"?

I'm working Power9 and utilizing the hardware random number generator instruction called DARN. I have the following inline assembly:

uint64_t val;
__asm__ __volatile__ (
    "xor 3,3,3                     \n"  // r3 = 0
    "addi 4,3,-1                   \n"  // r4 = -1, failure
    "1:                            \n"
    ".byte 0xe6, 0x05, 0x61, 0x7c  \n"  // r3 = darn 3, 1
    "cmpd 3,4                      \n"  // r3 == -1?
    "beq 1b                        \n"  // retry on failure
    "mr %0,3                       \n"  // val = r3
    : "=g" (val) : : "r3", "r4", "cc"
);

I had to add a mr %0,3 with "=g" (val) because I could not get GCC to produce expected code with "=r3" (val). Also see Error: matching constraint not valid in output operand.

A disassembly shows:

(gdb) b darn.cpp : 36
(gdb) r v
...

Breakpoint 1, DARN::GenerateBlock (this=<optimized out>,
    output=0x7fffffffd990 "\b", size=0x100) at darn.cpp:77
77              DARN64(output+i*8);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-222.el7.ppc64le libgcc-4.8.5-28.el7_5.1.ppc64le libstdc++-4.8.5-28.el7_5.1.ppc64le
(gdb) disass
Dump of assembler code for function DARN::GenerateBlock(unsigned char*, unsigned long):
   ...
   0x00000000102442b0 <+48>:    addi    r10,r8,-8
   0x00000000102442b4 <+52>:    rldicl  r10,r10,61,3
   0x00000000102442b8 <+56>:    addi    r10,r10,1
   0x00000000102442bc <+60>:    mtctr   r10
=> 0x00000000102442c0 <+64>:    xor     r3,r3,r3
   0x00000000102442c4 <+68>:    addi    r4,r3,-1
   0x00000000102442c8 <+72>:    darn    r3,1
   0x00000000102442cc <+76>:    cmpd    r3,r4
   0x00000000102442d0 <+80>:    beq     0x102442c8 <DARN::GenerateBlock(unsigned char*, unsigned long)+72>
   0x00000000102442d4 <+84>:    mr      r10,r3
   0x00000000102442d8 <+88>:    stdu    r10,8(r9)

Notice GCC faithfully reproduces the:

0x00000000102442d4 <+84>:    mr      r10,r3
0x00000000102442d8 <+88>:    stdu    r10,8(r9)

How do I get GCC to fold the two instructions into:

0x00000000102442d8 <+84>:    stdu    r3,8(r9)

Solution

GCC will never remove text that's part of the asm template; it doesn't even parse it other than substituting in for %operand. It's literally just a text substitution before the asm is sent to the assembler.

You have to leave out the mr from your inline asm template, and tell gcc that your output is in r3 (or use a memory-destination output operand, but don't do that). If your inline-asm template ever starts or ends with mov instructions, you're usually doing it wrong.

Use register uint64_t foo asm("r3"); to force "=r"(foo) to pick r3 on platforms that don't have specific-register constraints.

(Despite ISO C++17 removing the register keyword, this GNU extension still works with -std=c++17. You can also use register uint64_t foo __asm__("r3"); if you want to avoid the asm keyword. You probably still need to treat register as a reserved word in source that uses this extension; that's fine. ISO C++ removing it from the base language doesn't force implementations to not use it as part of an extension.)

Or better, don't hard-code a register number. Use an assembler that supports the DARN instruction. (But apparently it's so new that even up-to-date clang lacks it, and you'd only want this inline asm as a fallback for gcc too old to support the __builtin_darn() intrinsic)

Using these constraints will let you remove the register setup, too, and use foo=0 / bar=-1 before the inline asm statement, and use "+r"(foo).

But note that darn's output register is write-only. There's no need to zero r3 first. I found a copy of IBM's POWER ISA instruction set manual that is new enough to include darn here: https://wiki.raptorcs.com/w/images/c/cb/PowerISA_public.v3.0B.pdf#page=96

In fact, you don't need to loop inside the asm at all, you can leave that to the C and only wrap the one asm instruction, like inline-asm is designed for.

uint64_t random_asm() {
  register uint64_t val asm("r3");
  do {
    //__asm__ __volatile__ ("darn 3, 1");
      __asm__ __volatile__ (".byte 0x7c, 0x61, 0x05, 0xe6  # gcc asm operand = %0\n" : "=r" (val));
  } while(val == -1ULL);
  return val;
}

compiles cleanly (on the Godbolt compiler explorer) to

random_asm():
.L6:                 # compiler-generated label, no risk of name clashes
    .byte 0x7c, 0x61, 0x05, 0xe6  # gcc asm operand = 3

    cmpdi 7,3,-1     # compare-immediate
    beq 7,.L6
    blr

Just as tight as your loop, with less setup. (Are you sure you even need to zero r3 before the asm instruction?)

This function can inline anywhere you want it to, allowing gcc to emit a store instruction that reads r3 directly.

In practice, you'll want to use a retry counter, as advised in the manual: if the hardware RNG is broken, it might give you failure forever so you should have a fallback to a PRNG. (Same for x86's rdrand)

Deliver A Random Number (darn) - Programming Note

When the error value is obtained, software is expected to repeat the operation. If a non-error value has not been obtained after several attempts, a software random number generation method should be used. The recommended number of attempts may be implementation specific. In the absence of other guidance, ten attempts should be adequate.

xor-zeroing is not efficient on most fixed-instruction-width ISAs, because a mov-immediate is just as short so there's no need to detect and special-case an xor. (And thus CPU designs don't spend transistors on it). Moreover, dependency rules for the PPC asm equivalent of C++11 std::memory_order_consume require it to carry a dependency on the input register, so it couldn't be dependency-breaking even if the designers wanted it to. xor-zeroing is only a thing on x86 and maybe a few other variable-width ISAs.

Use li r3, 0 like gcc does for int foo(){return 0;} https://godbolt.org/z/-gHI4C.