Search code examples
sasscudaprofilingnvidiansight

Difference in SASS using cuobjdump and Nsight compute


I have a simple kernel as

__global__ void hello_cuda() {
    int a = 10;
    printf("hello from GPU\n");
}

When I use Nsight compute to see the Source and SASS section, I see:

#   Address                   Source
1   00007fe8 82f9ca00        IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
2   00007fe8 82f9ca10        UMOV UR4, 0x0 
3   00007fe8 82f9ca20        CS2R R6, SRZ 
4   00007fe8 82f9ca30        UMOV UR5, 0x0 
5   00007fe8 82f9ca40        MOV R4, UR4 
6   00007fe8 82f9ca50        IMAD.U32 R5, RZ, RZ, UR5 
7   00007fe8 82f9ca60        MOV R20, 0x0 
8   00007fe8 82f9ca70        MOV R21, 0x0 
9   00007fe8 82f9ca80        CALL.ABS.NOINC 0x0 
10  00007fe8 82f9ca90        EXIT 
11  00007fe8 82f9caa0        BRA 0x7fe882f9caa0
12  00007fe8 82f9cab0        NOP
13  00007fe8 82f9cac0        NOP
14  00007fe8 82f9cad0        NOP
15  00007fe8 82f9cae0        NOP
16  00007fe8 82f9caf0        NOP

But when I do cuobjdump -sass saas_source_mapping_trial, I get

    code for sm_52
        Function : _Z10hello_cudav
    .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                           /* 0x001fc400fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;  /* 0x4c98078000870001 */
        /*0010*/                   MOV32I R4, 0x0 ;        /* 0x010000000007f004 */
        /*0018*/                   MOV32I R5, 0x0 ;        /* 0x010000000007f005 */
                                                           /* 0x001ff400fec007f1 */
        /*0028*/                   MOV R6, RZ ;            /* 0x5c9807800ff70006 */
        /*0030*/                   MOV R7, RZ ;            /* 0x5c9807800ff70007 */
        /*0038*/                   JCAL 0x0 ;              /* 0xe220000000000040 */
                                                           /* 0x001ffc00fda007ef */
        /*0048*/                   NOP ;                   /* 0x50b0000000070f00 */
        /*0050*/                   NOP ;                   /* 0x50b0000000070f00 */
        /*0058*/                   EXIT ;                  /* 0xe30000000007000f */
                                                           /* 0x001f8000fc0007ff */
        /*0068*/                   BRA 0x60 ;              /* 0xe2400fffff07000f */
        /*0070*/                   NOP;                    /* 0x50b0000000070f00 */
        /*0078*/                   NOP;                    /* 0x50b0000000070f00 */
  1. Why is there a difference between my SASS and the NCU one?
  2. How did NCU get the address? The address from my cuobjdump is not the same at all.

Solution

  • Why is there a difference between my SASS and the NCU one?

    Unless you are running on a cc5.2 device, the nsight compute won't match the sass obtained with cuobjdump. cuobjdump shows you the compiled code in the binary. But if you have compiled with PTX included, and are running on a different architecture (than cc5.2 in this case) then the JIT mechanism will recompile your device code on the fly. nsight compute will "see" that recompiled device code. So the first step in harmonization/sanity in this situation is to compile for the device architecture you are running on, and make sure you are comparing SASS that is intended for the same architecture

    (I'm fairly confident you are not comparing apples to apples here. The cuobjdump SASS indicates it was compiled for sm_52, a maxwell architecture, whereas the ncu SASS includes a UMOV instruction, which targets the uniform datapath on a turing or newer GPU.)

    How did NCU get the address? The address from my cuobjdump is not the same at all.

    this is typical for runtime loaders in many architectures. The executable object (i.e. kernel) is placed into memory during the runtime loader process. That will affect the addresses displayed. When you are doing static analysis with cuobjdump, the "addresses" you see are relative addresses to the "base". Once the code is loaded into a particular place in memory, the addresses are different. They don't start at zero as the relative addresses do.

    And, again, since the actual instructions are different, some of the offsets may be different. First step towards sanity is to compile for the architecture you are running on, then use cuobjdump.