Search code examples
cudaloop-unrolling

CUDA loop unrolling on triangular region


Is it possible to unroll a loop on a triangular region, such as:

for (int i = 0; i < ROW_LENGTH; i++)
{
    for (int j = 0; j < i; j++)
    {
        // Some array operation here
    }
}

where ROW_LENGTH is a constant defined at compile time? As it stands now, I don't think this is possible because i is changing as the program executes (and more importantly, it's not a constant at compile time). I suppose you could treat the 2D array as a 1D array, iterate from 0 to (ROW_LENGTH^2)/2, and then try a couple math tricks to get the indices, but the extra operations defeat the purpose of the loop unrolling in the first place.


Solution

  • The CUDA 7.0 compiler will unroll this in my test. The loop indices are all known at compile time so there's no reason why it shouldn't be able to.

    Consider the following code, which sets a triangular portion of a to be 1.

    #define ROW_LENGTH 4
    __global__ void triUnrollTest1(float* a) {
       #pragma unroll
       for (int i = 0; i < ROW_LENGTH; i++)
       {
          #pragma unroll
          for (int j = 0; j < i; j++)
          {
             a[i * ROW_LENGTH + j] = 1.f;
          }
       }
    }
    

    As ROW_LENGTH only 4 we can unroll this ourselves:

    __global__ void triUnrollTest2(float* a) {
       a[1 * ROW_LENGTH + 0] = 1.f;
       a[2 * ROW_LENGTH + 0] = 1.f;
       a[2 * ROW_LENGTH + 1] = 1.f;
       a[3 * ROW_LENGTH + 0] = 1.f;
       a[3 * ROW_LENGTH + 1] = 1.f;
       a[3 * ROW_LENGTH + 2] = 1.f;
    }
    

    Compiling for SM 35 using CUDA 7.0: nvcc -arch=sm_35 -c triUnroll.cu

    Then dumping the SASS assembler: cuobjdump --dump-sass triUnroll.o

    We get:

    code for sm_35
            Function : _Z14triUnrollTest1Pf
    .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                              /* 0x08b8b8a0b010a000 */
    /*0008*/                   MOV R1, c[0x0][0x44];          /* 0x64c03c00089c0006 */
    /*0010*/                   MOV R0, c[0x0][0x140];         /* 0x64c03c00281c0002 */
    /*0018*/                   IADD R2.CC, R0, 0x10;          /* 0xc0840000081c0009 */
    /*0020*/                   MOV32I R0, 0x3f800000;         /* 0x741fc000001fc002 */
    /*0028*/                   IADD.X R3, RZ, c[0x0][0x144];  /* 0x60804000289ffc0e */
    /*0030*/                   ST.E [R2], R0;                 /* 0xe4800000001c0800 */
    /*0038*/                   ST.E [R2+0x10], R0;            /* 0xe4800000081c0800 */
                                                              /* 0x080000b810b8b8b8 */
    /*0048*/                   ST.E [R2+0x14], R0;            /* 0xe48000000a1c0800 */
    /*0050*/                   ST.E [R2+0x20], R0;            /* 0xe4800000101c0800 */
    /*0058*/                   ST.E [R2+0x24], R0;            /* 0xe4800000121c0800 */
    /*0060*/                   ST.E [R2+0x28], R0;            /* 0xe4800000141c0800 */
    /*0068*/                   EXIT;                          /* 0x18000000001c003c */
    /*0070*/                   BRA 0x70;                      /* 0x12007ffffc1c003c */
    /*0078*/                   NOP;                           /* 0x85800000001c3c02 */
            .....................................
    
    
            Function : _Z14triUnrollTest2Pf
    .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                              /* 0x08b8b8a0b010a000 */
    /*0008*/                   MOV R1, c[0x0][0x44];          /* 0x64c03c00089c0006 */
    /*0010*/                   MOV R0, c[0x0][0x140];         /* 0x64c03c00281c0002 */
    /*0018*/                   IADD R2.CC, R0, 0x10;          /* 0xc0840000081c0009 */
    /*0020*/                   MOV32I R0, 0x3f800000;         /* 0x741fc000001fc002 */
    /*0028*/                   IADD.X R3, RZ, c[0x0][0x144];  /* 0x60804000289ffc0e */
    /*0030*/                   ST.E [R2], R0;                 /* 0xe4800000001c0800 */
    /*0038*/                   ST.E [R2+0x10], R0;            /* 0xe4800000081c0800 */
                                                              /* 0x080000b810b8b8b8 */
    /*0048*/                   ST.E [R2+0x14], R0;            /* 0xe48000000a1c0800 */
    /*0050*/                   ST.E [R2+0x20], R0;            /* 0xe4800000101c0800 */
    /*0058*/                   ST.E [R2+0x24], R0;            /* 0xe4800000121c0800 */
    /*0060*/                   ST.E [R2+0x28], R0;            /* 0xe4800000141c0800 */
    /*0068*/                   EXIT;                          /* 0x18000000001c003c */
    /*0070*/                   BRA 0x70;                      /* 0x12007ffffc1c003c */
    /*0078*/                   NOP;                           /* 0x85800000001c3c02 */
            .....................................
    

    Obviously both are the same and nicely unrolled. Interestingly when I accidentally compiled with 6.5 for my first answer the compiler did not unroll, so I guess it pays to be up to date in this case!