Search code examples
concurrencyopenclsimdgpgpuamd-gpu

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?


As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf

enter image description here

4.1. SIMT Architecture

...

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.

The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread.

During runtime, the first wavefront is sent to the compute unit to run, then the second wavefront is sent to the compute unit, and so on. Work items within one wavefront are executed in parallel and in lock steps. But different wavefronts are executed sequentially.


I.e. we know, that:

  • threads in WARP (CUDA) - are SIMT-threads, which always executes the same instructions at each time and always are stay synchronized - i.e. threads of WARP are the same as lanes of SIMD (on CPU)

  • threads in WaveFront (OpenCL) - are threads, which always executes in parallel, but not necessarily all the threads perform the exact same instruction, and not necessarily all of the threads are synchronized

But is there any guarantee that all of the threads in the WaveFront always synchronized such as threads in WARP or as lanes in SIMD?


Conclusion:

  1. WaveFront-threads (items) are always synchronized - lock step: "wavefront executes a number of work-items in lock step relative to each other."
  2. WaveFront mapped on SIMD-block: "all work-items in the wavefront go to both paths of flow control"
  3. I.e. each WaveFront-thread (item) mapped to SIMD-lanes

page-20: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Terminology

...

Wavefronts and work-groups are two concepts relating to compute kernels that provide data-parallel granularity. A wavefront executes a number of work-items in lock step relative to each other. Sixteen workitems are execute in parallel across the vector unit, and the whole wavefront is covered over four clock cycles. It is the lowest level that flow control can affect. This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.

This is true for: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

  • (page-45) Chapter 2 OpenCL Performance and Optimization for GCN Devices
  • (page-81) Chapter 3 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices

Solution

  • First, you can query some values:

    CL_DEVICE_WAVEFRONT_WIDTH_AMD
    CL_DEVICE_SIMD_WIDTH_AMD
    CL_DEVICE_WARP_SIZE_NV
    CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
    

    but only from host side as I know.

    Lets assume these queries returned 64 and your question gives importance to threads' implicit synchronization.

    What if someone chooses local range = 4?

    Since opencl abstracts hardware clockwork from developer, you can't know what actual SIMD or WAVEFRONT size is from within kernel execution in runtime.

    For example, AMD NCU has 64 shaders but it has 16-wide SIMD, 8-wide SIMD, 4-wide SIMD, 2-wide SIMD and even two scalar units inside same compute unit.

    4 local threads could be shared on two scalars and one 2-wide unit or any other combination of SIMDs. Kernel code can't know this. Even if it knows somehow computing things, you can't know which SIMD combination will be used for next kernel execution(or even next workgroup) at runtime in a random compute-unit(64 shaders).

    Or a GCN CU, which has 4x16 SIMDs in it, could allocate 1 thread per SIMD, making all 4 threads totally independent. If they all reside in same SIMD, youre lucky. There is no guarantee knowing that "before" kernel execution. Even after you know, next kernel could be different since there is no guarantee of choosing same SIMD allocation(background kernels, 3d visualization softwares, even OS could be putting bubbles in pipelines)

    There is no guarantee of commanding/hinting/querying of N threads to run as same SIMD or same WARP before kernel execution. Then in the kernel, there is no command to get a thread's wavefront index just like get_global_id(0). Then after kernel, you can't rely on array results since next kernel execution may not use same SIMDs for exact same items. Even some items from other wavefronts could be swapped with an item from current wavefront just for an optimization by driver or hardware (nvidia has loadbalancer lately and could have been doing this, also NCU of amd may use similar thing in future)

    Even if you guess right combination of threads on SIMDs on your hardware and driver, it could be totally different in another computer.


    If its for a performance point of view, you could try:

    • zero-branch in kernel code
    • zero kernels running in background
    • gpu is not being used for monitor output
    • gpu is not being used for some visualization software

    Just to make sure %99 probability, there are no bubbles in pipelines so all threads retire an instruction at the same cycle(or at least synchronize at latest retiring one).

    Or, add a fence after every instruction to synchronize on global or local memory which is very slow. Fences make workitem level synhronization, barriers make local group level synchronization. There are no wavefront synchronization commands.

    Then, those threads that run within same SIMD will behave synchronized. But you may not know which threads those are and which SIMDs.

    For the 4-thread example, using float16 for all calculations may let the driver use 16-wide SIMDs of AMD GCN CU to compute but then they are not threads anymore, only variables. But this should have better synchronization on data, than threads.

    There are more complex situations such as:

    • 4 threads in same SIMD but one thread calculation generates some NaN value and does an extra normalization(taking 1-2 cycle maybe) on that. 3 others should wait for completion but it works independently of data related slowdowns.

    • 4 threads in same wavefront are in a loop and one of them stuck forever. 3 of them wait for the 4th one to finish forever or driver detects and moves it to another free-empty SIMD? Or 4th one waits for other 3 at the same time because they are not moving too!

    • 4 threads doing atomic operations, one by one.

    • Amd's HD5000 series gpu has SIMD width 4(or 5) but wavefront size is 64.