Search code examples
cudacpu-registersrestrict-qualifier

Can a const * __restrict__ increase cuda register usage?


Because my pointers are all pointing to non-overlapping memory I've went all out and replaced my pointers passed to kernels (and their inlined functions) to be restricted, and to made them const too, where ever possible. This however increased the register usage of some kernels and decreased it for others. This doesn't make make much sense to me.

Does anybody know why this can be the case?


Solution

  • Yes, it can increase register usage.

    Referring to the programming guide for __restrict__:

    The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions.

    Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.

    const __restrict__ may be beneficial for at least 2 reasons:

    1. On architectures that support it, it may enable the compiler to discover uses for the constant cache which may be a performance-enhancing feature.

    2. As indicated in the above linked programming guide section, it may enable other optimizations to be made by the compiler (e.g. reducing instructions and memory accesses) which also may improve performance if the corresponding register pressure does not become an issue.

    Reducing instructions and memory accesses leading to increased register pressure may be non-intuitive. Let's consider the example given in the above programming guide link:

    void foo(const float* a, const float* b, float* c) { 
      c[0] = a[0] * b[0]; 
      c[1] = a[0] * b[0]; 
      c[2] = a[0] * b[0] * a[1]; 
      c[3] = a[0] * a[1]; 
      c[4] = a[0] * b[0]; 
      c[5] = b[0]; ... }
    

    If we allow for pointer aliasing in the above example, then the compiler can't make many optimizations, and the compiler is essentially reduced to performing the code exactly as written. The first line of code:

      c[0] = a[0] * b[0]; 
    

    will require 3 registers. The next line of code:

      c[1] = a[0] * b[0]; 
    

    will also require 3 registers, and because everything is being generated as-written, they can be the same 3 registers, reused. Similar register reuse can occur for the remainder of the example, resulting in low overall register usage/pressure.

    But if we allow the compiler to re-order things, then we must have registers assigned for each value loaded up front, and reserved until that value is retired. This re-ordering can increase register usage/pressure, but may ultimately lead to faster code (or it may lead to slower code, if the register pressure becomes a performance limiter.)