Search code examples
cudarestrict-qualifier

How to combine __restrict__ with an array pointed to by a __constant__ pointer?


This will be a bit of a funky question I assume and if I need to elaborate, please say so.

The situation is as follows: I have about 2 gigs of GPU memory containing my random numbers and I need to use those in many different functions. To prevent passing around the pointers to this memory, from device function to device function (and this many times over), I put the pointers in the gpu constant memory, which is also saving me registers (for me very important). Now I know that functions can be sped up in some cases if they are explained that memory chunks pointed to by it's arguments are non-overlapping, by using the keyword __restrict__.

The question: how can I make sure the compiler knows that the memory chunks in global memory pointed to by the pointers in constant memory are non-overlapping (and maybe also nice to know: not ever changing after the generate randoms kernel call)?


Solution

  • I am not aware of a way to provide the compiler with heuristics on otherwise anonymous pointers.

    If you can manage it, the simplest way to try and help the compiler do its job is to pass the pointers as __restrict__ decorated kernel arguments and then force device functions inline. That will bypass the ABI and may allow the compiler to exploit the known non-aliasing condition to optimise memory access patterns. It should also help with the register footprint of your functions a bit. I'm not sure that __restrict__ will have much effect on __device__ functions or __constant__ declarations, but you have noted that the compiler accepts it, so I guess it can't hurt to at least try.

    I would look forward to comments from one of NVIDIA's toolchain or optimisation gurus on what might go on under the hood and what other tricks might be useful in this case.