Search code examples
cudagpunvidiagpgpuptx

How to get instruction cost in NVIDIA GPU?


I want to know about how many clock instruction cost in nvidia gpu, such as add, mul,ld/st and so on, How can I do ?

I had wrote some code to test and run in 2080Ti

    const int test_cnt = 1000;

        auto lastT = clock64();
        uint32_t res;
#pragma unroll
        for (int i = 0; i<test_cnt; ++i) {
            asm volatile("mul.lo.u32 %0, %0, %1;"
                    : "+r"(res)
                    : "r"(i));
            asm volatile("mul.hi.u32 %0, %0, %1;"
                    : "+r"(res)
                    : "r"(i));
        }
        printf("in gpu phase 1 :%lld %ld\n", clock64() - lastT, res);


But the result make me a little confused, the result output is:

in gpu phase 1 :6 0

Why so many times mul instruction, the clock cost is just 6 ? Is there some optimization in nvcc compiler?

I enter command cuobjdump --dump-ptx ./cutest

get the assmble instruction:

        mov.u64 %rd2, %clock64;

        mov.u32 %r38002, 0;

        mul.lo.u32 %r7, %r7, %r38002;

        
        mul.hi.u32 %r7, %r7, %r38002;

        mov.u32 %r38005, 1;

        mul.lo.u32 %r7, %r7, %r38005;

        
        mul.hi.u32 %r7, %r7, %r38005;

        mov.u32 %r38008, 2;

        mul.lo.u32 %r7, %r7, %r38008;

        
        mul.hi.u32 %r7, %r7, %r38008;

        mov.u32 %r38011, 3;

        mul.lo.u32 %r7, %r7, %r38011;

        
        mul.hi.u32 %r7, %r7, %r38011;

        mov.u32 %r38014, 4;

        mul.lo.u32 %r7, %r7, %r38014;

        
        mul.hi.u32 %r7, %r7, %r38014;

        mov.u32 %r38017, 5;

        mul.lo.u32 %r7, %r7, %r38017;


        ...
        ...
        ...
        ...
        ...
        ...
        ...


The above assmble instruction code show all is right, It is not optimization。 So why the clock cost output is so little?

And then, Is there other way to get instruction cost in NVIDIA GPU? Is there some document specify these detail?


Solution

  • Probably the most important takeaway here is do not use PTX for this kind of analysis.

    When I compile the code you have shown, the SASS code (what the GPU actually executes) doesn't have much resemblance to the PTX code you have shown:

    $ cat t2180.cu
    #include <cstdio>
    #include <cstdint>
    __global__ void k(){
    
        const int test_cnt = 1000;
    
            auto lastT = clock64();
            uint32_t res;
    #pragma unroll
            for (int i = 0; i<test_cnt; ++i) {
                asm volatile("mul.lo.u32 %0, %0, %1;"
                        : "+r"(res)
                        : "r"(i));
                asm volatile("mul.hi.u32 %0, %0, %1;"
                        : "+r"(res)
                        : "r"(i));
            }
            printf("in gpu phase 1 :%lu %u\n", clock64() - lastT, res);
    }
    
    int main(){
    
      k<<<1,1>>>();
      cudaDeviceSynchronize();
    }
    $ nvcc -o t2180 t2180.cu -arch=sm_75
    $ cuobjdump -ptx ./t2180
    
    Fatbin elf code:
    ================
    arch = sm_75
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
    Fatbin elf code:
    ================
    arch = sm_75
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
    Fatbin ptx code:
    ================
    arch = sm_75
    code version = [7,4]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    compressed
    
    
    
    
    
    
    
    
    .version 7.4
    .target sm_75
    .address_size 64
    
    
    .extern .func (.param .b32 func_retval0) vprintf
    (
    .param .b64 vprintf_param_0,
    .param .b64 vprintf_param_1
    )
    ;
    .global .align 1 .b8 $str[24] = {105, 110, 32, 103, 112, 117, 32, 112, 104, 97, 115, 101, 32, 49, 32, 58, 37, 108, 117, 32, 37, 117, 1
    0, 0};
    
    .visible .entry _Z1kv()
    {
    .local .align 8 .b8 __local_depot0[16];
    .reg .b64 %SP;
    .reg .b64 %SPL;
    .reg .b32 %r<6002>;
    .reg .b64 %rd<8>;
    
    
    mov.u64 %SPL, __local_depot0;
    cvta.local.u64 %SP, %SPL;
    
            mov.u64 %rd1, %clock64;
    
            mov.u32 %r5, 0;
    
            mul.lo.u32 %r7, %r7, %r5;
    
    
            mul.hi.u32 %r7, %r7, %r5;
    
            mov.u32 %r11, 1;
    
            mul.lo.u32 %r7, %r7, %r11;
    
    
            mul.hi.u32 %r7, %r7, %r11;
    
            mov.u32 %r17, 2;
    
            mul.lo.u32 %r7, %r7, %r17;
    
    
            mul.hi.u32 %r7, %r7, %r17;
    
            mov.u32 %r23, 3;
    
            < repeats many times >
    
    
            mul.lo.u32 %r7, %r7, %r5957;
    
    
            mul.hi.u32 %r7, %r7, %r5957;
    
            mov.u32 %r5963, 993;
    
            mul.lo.u32 %r7, %r7, %r5963;
    
    
            mul.hi.u32 %r7, %r7, %r5963;
    
            mov.u32 %r5969, 994;
    
            < repeats many times >
    
            mul.hi.u32 %r7, %r7, %r5999;
    
    
            mov.u64 %rd2, %clock64;
    
            sub.s64 %rd3, %rd2, %rd1;
    add.u64 %rd4, %SP, 0;
    add.u64 %rd5, %SPL, 0;
    st.local.u64 [%rd5], %rd3;
    st.local.u32 [%rd5+8], %r7;
    mov.u64 %rd6, $str;
    cvta.global.u64 %rd7, %rd6;
    {
            .reg .b32 temp_param_reg;
    .param .b64 param0;
    st.param.b64 [param0+0], %rd7;
    .param .b64 param1;
    st.param.b64 [param1+0], %rd4;
    .param .b32 retval0;
    call.uni (retval0),
    vprintf,
    (
    param0,
    param1
    );
    ld.param.b32 %r6001, [retval0+0];
    }
            ret;
    
    }
    
    $ cuobjdump -sass ./t2180
    
    Fatbin elf code:
    ================
    arch = sm_75
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
            code for sm_75
    
    Fatbin elf code:
    ================
    arch = sm_75
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
            code for sm_75
                    Function : _Z1kv
            .headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
            /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;  /* 0x00000a00ff017624 */
                                                                                /* 0x000fca00078e00ff */
            /*0010*/                   IADD3 R1, R1, -0x10, RZ ;                /* 0xfffffff001017810 */
                                                                                /* 0x000fc80007ffe0ff */
            /*0020*/                   IADD3 R6, P0, R1, c[0x0][0x20], RZ ;     /* 0x0000080001067a10 */
                                                                                /* 0x000fca0007f1e0ff */
            /*0030*/                   IMAD.X R7, RZ, RZ, c[0x0][0x24], P0 ;    /* 0x00000900ff077624 */
                                                                                /* 0x000fcc00000e06ff */
            /*0040*/                   CS2R R2, SR_CLOCKLO ;                    /* 0x0000000000027805 */
                                                                                /* 0x000fc40000015000 */
            /*0050*/                   CS2R R4, SR_CLOCKLO ;                    /* 0x0000000000047805 */
                                                                                /* 0x000fcc0000015000 */
            /*0060*/                   IADD3 R2, P0, -R2, R4, RZ ;              /* 0x0000000402027210 */
                                                                                /* 0x000fe20007f1e1ff */
            /*0070*/                   STL [R1+0x8], RZ ;                       /* 0x000008ff01007387 */
                                                                                /* 0x0001e20000100800 */
            /*0080*/                   UMOV UR4, 0x0 ;                          /* 0x0000000000047882 */
                                                                                /* 0x000fe40000000000 */
            /*0090*/                   UMOV UR5, 0x0 ;                          /* 0x0000000000057882 */
                                                                                /* 0x000fe20000000000 */
            /*00a0*/                   IMAD.X R3, R5, 0x1, ~R3, P0 ;            /* 0x0000000105037824 */
                                                                                /* 0x000fe400000e0e03 */
            /*00b0*/                   IMAD.U32 R4, RZ, RZ, UR4 ;               /* 0x00000004ff047e24 */
                                                                                /* 0x000fe4000f8e00ff */
            /*00c0*/                   IMAD.U32 R5, RZ, RZ, UR5 ;               /* 0x00000005ff057e24 */
                                                                                /* 0x000fc8000f8e00ff */
            /*00d0*/                   STL.64 [R1], R2 ;                        /* 0x0000000201007387 */
                                                                                /* 0x0001e80000100a00 */
            /*00e0*/                   MOV R20, 0x0 ;                           /* 0x0000000000147802 */
                                                                                /* 0x000fe40000000f00 */
            /*00f0*/                   MOV R21, 0x0 ;                           /* 0x0000000000157802 */
                                                                                /* 0x000fcc0000000f00 */
            /*0100*/                   CALL.ABS.NOINC 0x0 ;                     /* 0x0000000000007943 */
                                                                                /* 0x001fea0003c00000 */
            /*0110*/                   EXIT ;                                   /* 0x000000000000794d */
                                                                                /* 0x000fea0003800000 */
            /*0120*/                   BRA 0x120;                               /* 0xfffffff000007947 */
                                                                                /* 0x000fc0000383ffff */
            /*0130*/                   NOP;                                     /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
            /*0140*/                   NOP;                                     /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
            /*0150*/                   NOP;                                     /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
            /*0160*/                   NOP;                                     /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
            /*0170*/                   NOP;                                     /* 0x0000000000007918 */
                                                                                /* 0x000fc00000000000 */
                    ..........
    
    
    
    Fatbin ptx code:
    ================
    arch = sm_75
    code version = [7,4]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    compressed
    $
    

    The SASS code shows no evidence of your loop, nor any unrolling.

    Is there some optimization in nvcc compiler?

    Yes, the tool that converts PTX to SASS is an optimizing compiler. You now have an example of this above.

    Why so many times mul instruction, the clock cost is just 6 ?

    The biggest reason is that the code has been optimized to remove the loop, entirely.

    Is there other way to get instruction cost in NVIDIA GPU? Is there some document specify these detail?

    For the most part, NVIDIA doesn't publish anything like that.

    People who are interested in these things usually end up writing microbenchmarking codes, something like the one you wrote. Some notable example reports are published by citadel group, here is one. That one covers T4 GPU, which, for instruction latencies, should be similar to your 2080Ti.

    Why so many times mul instruction, the clock cost is just 6 ?

    How can I do ?

    I made a simple change to your code that "breaks" the compiler's ability to optimize:

    $ cat t2180.cu
    #include <cstdio>
    #include <cstdint>
    __global__ void k(int j){
    
        const int test_cnt = 1000;
    
            auto lastT = clock64();
            uint32_t res = 1;
    #pragma unroll
            for (int i = j; i<test_cnt; ++i) {
                asm volatile("mul.lo.u32 %0, %0, %1;"
                        : "+r"(res)
                        : "r"(i));
                asm volatile("mul.hi.u32 %0, %0, %1;"
                        : "+r"(res)
                        : "r"(i));
            }
            printf("in gpu phase 1 :%lu %u\n", clock64() - lastT, res);
    }
    
    int main(){
    
      k<<<1,1>>>(1);
      cudaDeviceSynchronize();
    }
    $ nvcc -o t2180 t2180.cu -arch=sm_75
    $ ./t2180
    in gpu phase 1 :14331 0
    $
    

    You now know how to compare the PTX and SASS. If you study the SASS for the above case, you will observe the existence of a loop in the SASS code, consistent with the loop in your source code.

    As an aside, your initial code had arithmetic being done and results being printed based on an uninitialized variable:

        uint32_t res;
    

    AFAIK that invokes UB in C++. My general understanding is that if your code contains UB, the results may be unpredictable or confusing. My general understanding is that a compiler may make "unexpected" optimizations in the presence of UB, although I'm not stating that is happening in this case. So my suggestion is to make sure your code is not invoking UB before you start microbenchmarking.