Search code examples
sasscuda

Why there is an unused data move in the beginning of CUDA kernel?


I'm trying to study the SASS file generated from a very basic CUDA kernel. Here is the kernel:

__global__ void kernel(const float * x,
                       float * y,
                       const uint num_rows,
                       const uint num_cols) {
    const uint num_elems = num_rows * num_cols;
    const uint tid = blockDim.x * blockIdx.x + threadIdx.x;
    for (uint idx = tid; idx < num_elems; idx += blockDim.x * gridDim.x) {
        y[idx] = x[idx];
    }
}

Here is the SASS file.

1   00007f26 14f69f00         MOV R1, c[0x0][0x28]
2   00007f26 14f69f10         S2R R0, SR_CTAID.X
3   00007f26 14f69f20         ULDC.64 UR4, c[0x0][0x178]
4   00007f26 14f69f30         UIMAD UR4, UR5, UR4, URZ
5   00007f26 14f69f40         S2R R3, SR_TID.X  3   3840
6   00007f26 14f69f50         IMAD R0, R0, c[0x0][0x0], R3
7   00007f26 14f69f60         ISETP.GE.U32.AND P0, PT, R0, UR4, PT
8   00007f26 14f69f70   @P0   EXIT
9   00007f26 14f69f80         ULDC.64 UR6, c[0x0][0x118]
10  00007f26 14f69f90         MOV R5, 0x4
11  00007f26 14f69fa0         IMAD.WIDE.U32 R2, R0, R5, c[0x0][0x160]
12  00007f26 14f69fb0         LDG.E R3, [R2.64]
13  00007f26 14f69fc0         IMAD.WIDE.U32 R4, R0, R5, c[0x0][0x168]
14  00007f26 14f69fd0         MOV R7, c[0x0][0x0]                               
15  00007f26 14f69fe0         IMAD R0, R7, c[0x0][0xc], R0
16  00007f26 14f69ff0         ISETP.GE.U32.AND P0, PT, R0, UR4, PT
17  00007f26 14f6a000         STG.E [R4.64], R3
18  00007f26 14f6a010   @!P0  BRA 0x7f2614f69f90
19  00007f26 14f6a020         EXIT
20  00007f26 14f6a030         BRA 0x7f2614f6a030                            

The Question:

In line one of the SASS, c[0x0][0x28] is transferred to R1, and we don't ever use it. This behavior is not limited to this kernel. I've tested it with several different simple kernels and always see this instruction. Does anyone know what is the purpose of this instruction?

More Information:

  1. We know c[0x0][xyzw] (hence, bank 0x0) stores to kernel arguments and launch configurations. However, it is still not clear why there is a seemingly useless move in the first line.
  2. This is not a small part of a large file. What you see is what I compile.

Solution

  • I've not found documentation on that either. However, R1 appears to serve as the stack pointer. You can see it in use in code like this:

    __global__ void foo(int* inout) {
        int tid = threadIdx.x;
        volatile int local[12];
        local[inout[tid]] = 12;
        inout[tid] = local[inout[tid + 1]];
    }
    
    foo(int*):
     MOV R1, c[0x0][0x20] 
     IADD32I R1, R1, -0x30 
     S2R R4, SR_TID.X         
     SHR R0, R4.reuse, 0x1e 
     ISCADD R4.CC, R4, c[0x0][0x140], 0x2 
     IADD.X R5, R0, c[0x0][0x144] 
     LDG.E R0, [R4] 
     LDG.E R2, [R4+0x4] 
     MOV32I R3, 0xc 
     LEA R0, R0, R1.reuse, 0x2 
     LEA R2, R2, R1, 0x2 
     STL [R0], R3         
     LDL R2, [R2] 
     STG.E [R4], R2 
     NOP 
     EXIT 
    .L_x_0:
     BRA `(.L_x_0) 
     NOP
    .L_x_1:
    

    The location of the initial offset in constant memory seems to change between architectures since here (SM 5.2) it is 0x20.

    However, it doesn't appear to be a literal stack pointer since all threads start with the same value. The STL and LDL instructions factor in some per-thread offset and scale so that 32 bit accesses are fully coalesced as long as all threads in a warp access the same relative address.

    As to why that load is not eliminated, I don't know. Maybe so that the debugger or some other mechanism like machine exception handling always has a valid stack pointer.