Search code examples
cudagpunvidianvprof

CUDA logarithm: nvprof detects single precision operations in double precision


I'm computing "log(x)" in double precision in CUDA, but when I profile, it detects single precision operations using metric "flop_count_sp_special".

I'm compiling with "-arch=sm_30" to ensure compute capability 3.0 and double precision arithmetic, but I can't seem to find a way to ensure special functions are computed in double precision. Is this possible?


Solution

  • It appears that the CUDA calculation of double-precision floating-point log does involve (a small number of) single precision floating point calculations. I think this need not be cause for concern, in and of itself.

    The GPU hardware does not support a native double-precision log calculation, therefore the compiler, upon encountering this, replaces the instruction with a sequence of operations (e.g. add, multiply, etc.) designed to perform the calculation. We can confirm this with a simple example:

    $ cat t288.cu
    #include <math.h>
    #include <stdio.h>
    __global__ void kernel(double *y)
    {
      double x = *y;
      *y = log(x);
    }
    
    int main(){
    
      double *d_y, h_y = 10.0;
      cudaMalloc(&d_y, sizeof(double));
      cudaMemcpy(d_y, &h_y, sizeof(double), cudaMemcpyHostToDevice);
      kernel<<<1,1>>>(d_y);
      cudaDeviceSynchronize();
      cudaMemcpy(&h_y, d_y, sizeof(double), cudaMemcpyDeviceToHost);
      printf("val = %f\n", h_y);
    }
    
    $ nvcc -arch=sm_35 -o t288 t288.cu
    $ cuda-memcheck ./t288
    ========= CUDA-MEMCHECK
    val = 2.302585
    ========= ERROR SUMMARY: 0 errors
    $ CUDA_VISIBLE_DEVICES="1" nvprof --metrics flop_count_dp,flop_count_sp,flop_count_sp_special ./t288
    ==14909== NVPROF is profiling process 14909, command: ./t288
    ==14909== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
    Replaying kernel "kernel(double*)" (done)
    val = 2.302585rnal events
    ==14909== Profiling application: ./t288
    ==14909== Profiling result:
    ==14909== Metric result:
    Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
    Device "Tesla K40m (0)"
        Kernel: kernel(double*)
              1                             flop_count_dp           Floating Point Operations(Double Precision)          44          44          44
              1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
              1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           1           1           1
    $ cuobjdump -sass ./t288
    
    Fatbin elf code:
    ================
    arch = sm_35
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
            code for sm_35
    
    Fatbin elf code:
    ================
    arch = sm_35
    code version = [1,7]
    producer = cuda
    host = linux
    compile_size = 64bit
    
            code for sm_35
                    Function : _Z6kernelPd
            .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                      /* 0x088010fc10a010ac */
            /*0008*/                   MOV R1, c[0x0][0x44];                          /* 0x64c03c00089c0006 */
            /*0010*/                   MOV R2, c[0x0][0x140];                         /* 0x64c03c00281c000a */
            /*0018*/                   MOV R3, c[0x0][0x144];                         /* 0x64c03c00289c000e */
            /*0020*/                   LD.E.64 R12, [R2];                             /* 0xc5800000001c0830 */
            /*0028*/                   MOV32I R0, 0xfffffc01;                         /* 0x747ffffe009fc002 */
            /*0030*/                   ISETP.GT.AND P0, PT, R13, c[0x2][0x0], PT;     /* 0x5b481c40001c341e */
            /*0038*/                   MOV R4, R12;                                   /* 0xe4c03c00061c0012 */
                                                                                      /* 0x08ac1080a0a0a4ac */
            /*0048*/                   MOV R5, R13;                                   /* 0xe4c03c00069c0016 */
            /*0050*/              @!P0 DMUL R4, R4, 1.80143985094819840000e+16;       /* 0xc400021a80201011 */
            /*0058*/              @!P0 MOV R13, R5;                                   /* 0xe4c03c0002a00036 */
            /*0060*/                   IADD32I R6, R13, -0x1;                         /* 0x407fffffff9c3419 */
            /*0068*/                   ISETP.LT.U32.AND P1, PT, R6, c[0x2][0x4], PT;  /* 0x5b101c40009c183e */
            /*0070*/              @!P0 MOV R12, R4;                                   /* 0xe4c03c0002200032 */
            /*0078*/              @!P0 MOV32I R0, 0xfffffbcb;                         /* 0x747ffffde5a3c002 */
                                                                                      /* 0x08b88010a4a010ac */
            /*0088*/               @P1 BRA 0xc0;                                      /* 0x120000001804003c */
            /*0090*/                   MOV32I R8, 0x0;                                /* 0x74000000001fc022 */
            /*0098*/                   MOV32I R9, 0x7ff00000;                         /* 0x743ff800001fc026 */
            /*00a0*/                   DFMA R8, R4, +INF , R8;                        /* 0xb38023ff801c1021 */
            /*00a8*/                   FCMP.NEU R12, R8, RZ, R5;                      /* 0xdd6814007f9c2032 */
            /*00b0*/                   FCMP.NEU R13, R9, -QNAN , R5;                  /* 0xbd6817ff801c2435 */
            /*00b8*/                   BRA 0x240;                                     /* 0x12000000c01c003c */
                                                                                      /* 0x08a010b010a0a010 */
            /*00c8*/                   LOP32I.AND R4, R13, 0x800fffff;                /* 0x204007ffff9c3410 */
            /*00d0*/                   IMAD.U32.U32.HI R0, R13, 0x1000, R0;           /* 0xa2000008001c3401 */
            /*00d8*/                   LOP32I.OR R5, R4, 0x3ff00000;                  /* 0x211ff800001c1014 */
            /*00e0*/                   ISETP.LT.AND P0, PT, R5, c[0x2][0x8], PT;      /* 0x5b181c40011c141e */
            /*00e8*/                   MOV R4, R12;                                   /* 0xe4c03c00061c0012 */
            /*00f0*/              @!P0 IADD32I R7, R5, -0x100000;                     /* 0x407ff8000020141d */
            /*00f8*/              @!P0 IADD32I R0, R0, 0x1;                           /* 0x4000000000a00001 */
                                                                                      /* 0x08a010a01080a010 */
            /*0108*/              @!P0 MOV R5, R7;                                    /* 0xe4c03c0003a00016 */
            /*0110*/                   LOP32I.XOR R12, R0, 0x80000000;                /* 0x22400000001c0030 */
            /*0118*/                   DADD R6, R4, 1;                                /* 0xc38001ff801c1019 */
            /*0120*/                   MOV R8, RZ;                                    /* 0xe4c03c007f9c0022 */
            /*0128*/                   DADD R4, R4, -1;                               /* 0xcb8001ff801c1011 */
            /*0130*/                   MUFU.RCP64H R9, R7;                            /* 0x84000000031c1c26 */
            /*0138*/                   MOV32I R14, 0x8b7a8b04;                        /* 0x7445bd45821fc03a */
                                                                                      /* 0x08a01080a4a4a4a4 */
            /*0148*/                   DFMA R6, -R6, R8, c[0x2][0x10];                /* 0x9b882040021c181a */
            /*0150*/                   DFMA R6, R6, R6, R6;                           /* 0xdb801800031c181a */
            /*0158*/                   DFMA R6, R8, R6, R8;                           /* 0xdb802000031c201a */
            /*0160*/                   DMUL R8, R6, R4;                               /* 0xe4000000021c1822 */
            /*0168*/                   DFMA R8, R6, R4, R8;                           /* 0xdb802000021c1822 */
            /*0170*/                   MOV32I R15, 0x3ed0ee25;                        /* 0x741f6877129fc03e */
            /*0178*/                   MOV32I R13, 0x43300000;                        /* 0x74219800001fc036 */
                                                                                      /* 0x08a080a080a4a4a4 */
            /*0188*/                   DMUL R10, R8, R8;                              /* 0xe4000000041c202a */
            /*0190*/                   DFMA R14, R10, c[0x2][0x18], R14;              /* 0x5b803840031c283a */
            /*0198*/                   DFMA R14, R10, R14, c[0x2][0x20];              /* 0x9b803840041c283a */
            /*01a0*/                   DFMA R14, R10, R14, c[0x2][0x28];              /* 0x9b803840051c283a */
            /*01a8*/                   DADD R16, R4, -R8;                             /* 0xe3810000041c1042 */
            /*01b0*/                   DFMA R14, R10, R14, c[0x2][0x30];              /* 0x9b803840061c283a */
            /*01b8*/                   DADD R18, R16, R16;                            /* 0xe3800000081c404a */
                                                                                      /* 0x088880948880a080 */
            /*01c8*/                   DFMA R14, R10, R14, c[0x2][0x38];              /* 0x9b803840071c283a */
            /*01d0*/                   DADD R12, R12, c[0x2][0x50];                   /* 0x638000400a1c3032 */
            /*01d8*/                   DFMA R16, R10, R14, c[0x2][0x40];              /* 0x9b803840081c2842 */
            /*01e0*/                   DFMA R14, R12, c[0x2][0x58], R8;               /* 0x5b8020400b1c303a */
            /*01e8*/                   DFMA R4, -R4, R8, R18;                         /* 0xdb884800041c1012 */
            /*01f0*/                   DFMA R16, R10, R16, c[0x2][0x48];              /* 0x9b804040091c2842 */
            /*01f8*/                   DFMA R18, -R12, c[0x2][0x58], R14;             /* 0x5b8838400b1c304a */
                                                                                      /* 0x08aca4a4a4a08094 */
            /*0208*/                   DMUL R4, R6, R4;                               /* 0xe4000000021c1812 */
            /*0210*/                   DMUL R10, R10, R16;                            /* 0xe4000000081c282a */
            /*0218*/                   DADD R18, -R8, R18;                            /* 0xe3880000091c204a */
            /*0220*/                   DFMA R4, R8, R10, R4;                          /* 0xdb801000051c2012 */
            /*0228*/                   DADD R18, R4, -R18;                            /* 0xe3810000091c104a */
            /*0230*/                   DFMA R12, R12, c[0x2][0x60], R18;              /* 0x5b8048400c1c3032 */
            /*0238*/                   DADD R12, R14, R12;                            /* 0xe3800000061c3832 */
                                                                                      /* 0x080000000000b810 */
            /*0248*/                   ST.E.64 [R2], R12;                             /* 0xe5800000001c0830 */
            /*0250*/                   EXIT;                                          /* 0x18000000001c003c */
            /*0258*/                   BRA 0x258;                                     /* 0x12007ffffc1c003c */
            /*0260*/                   NOP;                                           /* 0x85800000001c3c02 */
            /*0268*/                   NOP;                                           /* 0x85800000001c3c02 */
            /*0270*/                   NOP;                                           /* 0x85800000001c3c02 */
            /*0278*/                   NOP;                                           /* 0x85800000001c3c02 */
                    ......................
    
    
    
    Fatbin ptx code:
    ================
    arch = sm_35
    code version = [6,2]
    producer = cuda
    host = linux
    compile_size = 64bit
    compressed
    $
    

    First of all we see that the profiler reports this double precision log op using mostly double-precision calculation but also one single-precision "special" op.

    Looking at the SASS dump, we observe this single-precision instruction:

            /*0130*/                   MUFU.RCP64H R9, R7;                            /* 0x84000000031c1c26 */
    

    (we can confirm these are single-precision by referring to the documentation)

    Most of the algorithm appears to use double-precision floating point. This particular step may be involved in some estimation that only requires single-precision.

    So it would appear to be "expected" that use of log may report some (nonzero) flop_count_sp_special metric.

    (Yes, the FCMP instructions are also single-precision, but they appear to be in a code path not taken, for this case.)