register usage on nvidia with hipSYCL / llvm

I am looking at the performance of a sycl port of some hpc code, which I am running on a GV100 card via hipSYCL.

Running the code through a profiler tells me that very high register usage is the likely limiting factor for performance.

Is there any way of influencing register usage of the gpu code that hipSYCL / clang generates, something akin to nvcc's -maxregcount option?

Solution

hipSYCL invokes the clang CUDA toolchain. As far as I know clang CUDA and the LLVM nvptx backend do not have a direct analogue to -maxregcount, but maybe the LLVM nvptx backend option --nvptx-sched4reg can help. It tells the optimizer to schedule for minimum register pressure instead of just following the source.

If you use accessors, you can also try to use SYCL 2020 USM pointers instead. In hipSYCL[1] accessors will always use more registers because they need to store the valid access range and offset as well.

[1] and also any other SYCL implementation that relies heavily on library-only semantics