cuda language-lawyer inline-assembly redundancy ptx

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2):

The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that the asm is not deleted or moved during generation of PTX, you should use the volatile keyword, e.g.:
asm volatile ("mov.u32 %0, %%clock;" : "=r"(x));
Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon...

It sounds like both volatile and :: "memory" are intended to indicate side effects in memory. Now, granted, there could be non-memory side effects (like for trap;). But - when I've used volatile, isn't it useless/meaningless to also specify :: "memory")?

Solution

A non-volatile inline asm statement is treated as a pure function of its inputs: gives the same output every time when run with the same explicit inputs.

And separately, without a "memory" clobber: doesn't read or write anything that hasn't been mentioned as an input or output operand.

It sounds like both volatile and :: "memory" are intended to indicate side effects in memory.

No, volatile just means that the output operands are not a pure function of the input operands. A "memory" clobber is mostly orthogonal and is not implied by volatile

The example you quoted appears to be reading a %%clock cycle counter or something which needs to re-execute every time, otherwise the compiler could CSE and hoist it out of a loop. You wouldn't want that to force the compiler to spill/reload any global vars it had in registers. volatile doesn't imply memory side-effects so it's just the ticket for this use-case.

It would still be a bug for the asm template to read or write any other variables behind the compiler's back (not via explicit "m", "=m", or "+m" operands) because volatile doesn't imply a "memory" clobber.

In GNU C inline asm even an "r"(pointer_variable) does not imply that the pointed-to data is read or written. e.g. an assignment can be optimized away as a dead stores if all you do with the variable is pass a pointer to it as an input to an asm statement without a "memory" clobber. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

A "memory" clobber will get the compiler to assume that any globally-reachable memory (or reachable via pointer inputs) may have been read or written, and thus spill/reload vars from registers around such an asm statement. (Unless escape analysis can prove that nothing else could have a pointer to them, i.e. that a pointer to the var hasn't "escaped" the local scope. Just like how compilers decide they can keep a var in a register across a non-inline function call.)

So is "memory" alone safe without volatile? No

A "memory" clobber does not stop an asm statement from optimizing away if none of its explicit output operands are used. (With no "=..." operands, an asm statement is implicitly volatile).

A non-volatile asm statement with a memory clobber has to be assumed to modify any reachable memory at that point in the abstract machine if/when the asm template string executes, but the compiler is still free to make transformations that result in that not happening at all, or less often than the source would. (e.g. hoist it out of a loop if the other vars that change in the loop are all locals whose address hasn't escaped the function.)

A non-volatile asm statement is still assumed to be a pure function wrt. its explicit inputs and outputs, so asm("..." : "=r"(out) : "r"(in) : "memory"); could be hoisted out of a loop if the loop used the same "in" every iteration. (This could only happen if the loop variables were all locals which the asm statement couldn't have a pointer to (escape analysis like for a non-inline function call). Otherwise the "memory" clobber would block that reordering.)

Or optimized away entirely if all uses of "out" can be optimized away, regardless of any memory accesses around the statement. The decision is only based on the explicit operands if you omit volatile.

There's not a lot of use-case for a "memory" clobber without volatile; you could imagine using it to describe a function that internally uses a cache to memoize results. The compiler can run it as often or as infrequently as it wants, and we don't actually care whether the internal cache got mutated or not. It's a side effect but not a valuable side effect.

(I'm assuming that CUDA inline asm has identical semantics to GNU C inline asm as supported/implemented by Clang/LLVM and by GCC. From the quote that certainly appears to be the case. I don't really know anything about CUDA so everything I said above is based on GNU C inline asm, because CUDA asm appears to be identical. Correct me if I'm wrong, e.g. if asm statements with no output operands are not implicitly volatile or if CUDA doesn't have pointers.

Since GNU C inline asm syntax was designed for C and later repurposed for CUDA instead, it may help your understanding of the design to think in terms of C including pointers and escape analysis.)