Here's a snippet of some SASS code for a kernel I'm working on (for an sm52 target, compiled in debugging mode):
/*0028*/ ISETP.GE.U32.AND P0, PT, R1, R0, PT; /* 0x5b6c038000070107 */
/*0030*/ @P0 BRA 0x40; /* 0xe24000000080000f */
/*0038*/ BPT.TRAP 0x1; /* 0xe3a00000001000c0 */
/* 0x007fbc0321e01fef */
/*0048*/ IADD R2, R1, RZ; /* 0x5c1000000ff70102 */
/*0050*/ I2I.U32.U32 R2, R2; /* 0x5ce0000000270a02 */
/*0058*/ MOV R2, R2; /* 0x5c98078000270002 */
/* 0x007fbc03fde01fef */
/*0068*/ MOV R3, RZ; /* 0x5c9807800ff70003 */
/*0070*/ MOV R2, R2; /* 0x5c98078000270002 */
/*0078*/ MOV R3, R3; /* 0x5c98078000370003 */
/* 0x007fbc03fde01fef */
/*0088*/ MOV R4, R2; /* 0x5c98078000270004 */
/*0090*/ MOV R5, R3; /* 0x5c98078000370005 */
/*0098*/ MOV R2, c[0x0][0x4]; /* 0x4c98078000170002 */
/* 0x007fbc03fde01fef */
/*00a8*/ MOV R3, RZ; /* 0x5c9807800ff70003 */
/*00b0*/ LOP.OR R2, R4, R2; /* 0x5c47020000270402 */
/*00b8*/ LOP.OR R3, R5, R3; /* 0x5c47020000370503 */
I'm noticing more than a couple of instructions of the form "Move the contents of register Rn to register Rn" - and that doesn't seen to make sense. I know that when compiling without debugging info enabled, and with optimizations, I don't get these instructions. But, even in debugging mode - why are they there? What's their purpose? AFAIK, when compiling CPU code for debugging you don't get these kind of instructions.
The simple answer you get that get strange code because you've turned on debugging which turns off optimization. This is normal with modern optimizing compilers because of how they work. They break down operations into a primitive static single-assignment (SSA) form which makes it easier to optimize but when not optimizing generates worse code that more simpler non-optimizing compiler would.
There's also a possibility, though I don't think it's the case here, that the instructions are deliberately inserted NOPs in order delay execution. GPUs have instruction sets that are much much different than the general purpose CPUs that you may familiar with. For example most CPUs work as if instructions are executed one at a time and strictly in the order they're given. This is true despite the fact that modern CPUs will try to execute instructions in parallel and even out of order, for improved performance. GPUs typically don't work this way. If you try to use the result that a previous instruction stores in some register before that instruction is finished, you'll get the old value of the register. Unlike a CPU, a GPU won't automatically wait for the instruction to finish before executing the next instruction that depends on it.
If you look at the dissembled code you'll notice that instructions are grouped into bundles of three instructions. You might also see that there's hidden instructions between the bundles. The machine code for the instruction is shown on the right (eg. /* 0x007fbc0321e01fef */
), but its not disassembled on the left and its address isn't shown despite taking up an 8-byte slot like any other instruction. This actually a scheduling block control code. It's not a real instruction, but instead it instructs the GPU how it should schedule the instructions in the bundle before it. It tells the GPU things like which instructions need to wait for previous instructions to complete and how long they should wait.
Finally there's one more possibility, though extremely unlikely, that the redundant MOVs aren't actually NOPs at all. They could be acting on yet to overwritten register values and in parallel with other instructions in some weird manner that gives them a useful effect other than a delay. However this would be a very advanced optimization technique that I would only expect in hand-tuned assembly code, not in a compiler that isn't even generating optimized code.