I'm trying to study the SASS
file generated from a very basic CUDA kernel. Here is the kernel:
__global__ void kernel(const float * x,
float * y,
const uint num_rows,
const uint num_cols) {
const uint num_elems = num_rows * num_cols;
const uint tid = blockDim.x * blockIdx.x + threadIdx.x;
for (uint idx = tid; idx < num_elems; idx += blockDim.x * gridDim.x) {
y[idx] = x[idx];
}
}
Here is the SASS
file.
1 00007f26 14f69f00 MOV R1, c[0x0][0x28]
2 00007f26 14f69f10 S2R R0, SR_CTAID.X
3 00007f26 14f69f20 ULDC.64 UR4, c[0x0][0x178]
4 00007f26 14f69f30 UIMAD UR4, UR5, UR4, URZ
5 00007f26 14f69f40 S2R R3, SR_TID.X 3 3840
6 00007f26 14f69f50 IMAD R0, R0, c[0x0][0x0], R3
7 00007f26 14f69f60 ISETP.GE.U32.AND P0, PT, R0, UR4, PT
8 00007f26 14f69f70 @P0 EXIT
9 00007f26 14f69f80 ULDC.64 UR6, c[0x0][0x118]
10 00007f26 14f69f90 MOV R5, 0x4
11 00007f26 14f69fa0 IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x160]
12 00007f26 14f69fb0 LDG.E R3, [R2.64]
13 00007f26 14f69fc0 IMAD.WIDE.U32 R4, R0, R5, c[0x0][0x168]
14 00007f26 14f69fd0 MOV R7, c[0x0][0x0]
15 00007f26 14f69fe0 IMAD R0, R7, c[0x0][0xc], R0
16 00007f26 14f69ff0 ISETP.GE.U32.AND P0, PT, R0, UR4, PT
17 00007f26 14f6a000 STG.E [R4.64], R3
18 00007f26 14f6a010 @!P0 BRA 0x7f2614f69f90
19 00007f26 14f6a020 EXIT
20 00007f26 14f6a030 BRA 0x7f2614f6a030
The Question:
In line one of the SASS
, c[0x0][0x28]
is transferred to R1
, and we don't ever use it. This behavior is not limited to this kernel. I've tested it with several different simple kernels and always see this instruction. Does anyone know what is the purpose of this instruction?
More Information:
c[0x0][xyzw]
(hence, bank 0x0) stores to kernel arguments and launch configurations. However, it is still not clear why there is a seemingly useless move in the first line.I've not found documentation on that either. However, R1
appears to serve as the stack pointer. You can see it in use in code like this:
__global__ void foo(int* inout) {
int tid = threadIdx.x;
volatile int local[12];
local[inout[tid]] = 12;
inout[tid] = local[inout[tid + 1]];
}
foo(int*):
MOV R1, c[0x0][0x20]
IADD32I R1, R1, -0x30
S2R R4, SR_TID.X
SHR R0, R4.reuse, 0x1e
ISCADD R4.CC, R4, c[0x0][0x140], 0x2
IADD.X R5, R0, c[0x0][0x144]
LDG.E R0, [R4]
LDG.E R2, [R4+0x4]
MOV32I R3, 0xc
LEA R0, R0, R1.reuse, 0x2
LEA R2, R2, R1, 0x2
STL [R0], R3
LDL R2, [R2]
STG.E [R4], R2
NOP
EXIT
.L_x_0:
BRA `(.L_x_0)
NOP
.L_x_1:
The location of the initial offset in constant memory seems to change between architectures since here (SM 5.2) it is 0x20.
However, it doesn't appear to be a literal stack pointer since all threads start with the same value. The STL
and LDL
instructions factor in some per-thread offset and scale so that 32 bit accesses are fully coalesced as long as all threads in a warp access the same relative address.
As to why that load is not eliminated, I don't know. Maybe so that the debugger or some other mechanism like machine exception handling always has a valid stack pointer.