Search code examples
c++cudaatomicinstruction-setptx

How can I utilize the 'red' and 'atom' PTX instructions in CUDA C++ code?


The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back. Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?


Solution

  • Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?

    Most of these instructions are reflected in atomic operations (built-in intrinsics) described in the programming guide. If you compile any of those atomic intrinsics, you will find atom or red instructions emitted by the compiler at the PTX or SASS level in your generated code.

    The red instruction type will generally be used when you don't explicitly use the return value from from one of the atomic intrinsics. If you use the return value explicitly, then the compiler usually emits the atom instruction.

    Thus, it should be clear that this instruction by itself does not perform a complete classical parallel reduction, but certainly could be used to implement one if you wanted to depend on atomic hardware (and associated limitations) for your reduction operations. This is generally not the fastest possible implementation for parallel reductions.

    If you want direct access to these instructions, the usual advice would be to use inline PTX where desired.

    As requested, to elaborate using atomicAdd() as an example:

    If I perform the following:

    atomicAdd(&x, data);
    

    perhaps because I am using it for a typical atomic-based reduction into the device variable x, then the compiler would emit a red (PTX) or RED (SASS) instruction taking the necessary arguments (the pointer to x and the variable data, i.e. 2 logical registers).

    If I perform the following:

    int offset = atomicAdd(&buffer_ptr, buffer_size);
    

    perhaps because I am using it not for a typical reduction but instead to reserve a space (buffer_size) in a buffer shared amongst various threads in the grid, which has an offset index (buffer_ptr) to the next available space in the shared buffer, then the compiler would emit a atom (PTX) or ATOM (SASS) instruction, including 3 arguments (offset, &buffer_ptr, and buffer_size, in registers).

    The red form can be issued by the thread/warp which may then continue and not normally stall due to this instruction issue which will normally have no dependencies for subsequent instructions. The atom form OTOH will imply modification of one of its 3 arguments (one of 3 logical registers). Therefore subsequent use of the data in that register (i.e. the return value of the intrinsic, i.e. offset in this case) can result in a thread/warp stall, until the return value is actually returned by the atomic hardware.