Search code examples
concurrencyllvmgpgpuamd-gpuhip

What are the requirements for using `shfl` operations on AMD GPU using HIP C++?


There is AMD HIP C++ which is very similar to CUDA C++. Also AMD created Hipify to convert CUDA C++ to HIP C++ (Portable C++ Code) which can be executed on both nVidia GPU and AMD GPU: https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

requirement for nvidia

please make sure you have a 3.0 or higher compute capable device in order to use warp shfl operations and add -gencode arch=compute=30, code=sm_30 nvcc flag in the Makefile while using this application.

In addition, HIP defines portable mechanisms to query architectural features, and supports a larger 64-bit wavesize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit ints to 64-bit ints.

But which of AMD GPUs does support functions shfl, or does any AMD GPU support shfl because on AMD GPU it implemented by using Local-memory without hardware instruction register-to-register?

nVidia GPU required 3.0 or higher compute capable (CUDA CC), but what are the requirements for using shfl operations on AMD GPU using HIP C++?


Solution

    1. Yes, there are new instructions in GPU GCN3 such as ds_bpermute and ds_permute which can provide the functionality such as __shfl() and even more

    2. These ds_bpermute and ds_permute instructions use only route of Local memory (LDS 8.6 TB/s), but don't actually use Local memory, this allows to accelerate data exchange between threads: 8.6 TB/s < speed < 51.6 TB/s: http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

    They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.

    1. Also there are Data-Parallel Primitives (DPP) - is especially powerful when you can use it since an op can read registers of neighboring workitems directly. I.e. DPP can access to neighboring thread (workitem) at full speed ~51.6 TB/s

    http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/

    now, most of the vector instructions can do cross-lane reading at full throughput.

    For example, wave_shr-instruction (Wavefront shift right) for Scan algorithm:

    enter image description here

    More about GCN3: https://github.com/olvaffe/gpu-docs/raw/master/amd-open-gpu-docs/AMD_GCN3_Instruction_Set_Architecture.pdf

    New Instructions

    • “SDWA” – Sub Dword Addressing allows access to bytes and words of VGPRs in VALU instructions.
    • “DPP” – Data Parallel Processing allows VALU instructions to access data from neighboring lanes.
    • DS_PERMUTE_RTN_B32, DS_BPERMPUTE_RTN_B32.

    ...

    DS_PERMUTE_B32 Forward permute. Does not write any LDS memory.