Search code examples
multithreadingconcurrencyopenclgpgpuamd-gpu

Does the official OpenCL 2.2 standard support the WaveFront?


As known, AMD-OpenCL supports WaveFront (August 2015): http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf

The AMD Radeon HD 7770 GPU, for example, supports more than 25,000 in-flight work-items and can switch to a new wavefront (containing up to 64 work-items) in a single cycle.


But why in the OpenCL standards 1.0/2.0/2.2 there is no mention about the WaveFront?

None of the PDF has not a word WaveFront: https://www.khronos.org/registry/OpenCL/specs/

Also I found that:

OpenCL is a open standard. It still does not support this swizzling concept. It does not even support wavefront/warp yet.

That's why the concept is not on the OpenCL specification itself.

Standard OpenCL doesn't have the notion of a "wavefront"

enter image description here

Indeed the official OpenCL 2.2 standard still does not support the WaveFront?


CONCLUSION:

There is no WaveFront in OpenCL standard, but in OpenCL-2.0 there is Sub-groups with SIMD execution model akin to WaveFronts.

6.4.2 Workgroup/subgroup-level functions

OpenCL 2.0 introduces a Khronos sub-group extension. Sub-groups are a logical abstraction of the hardware SIMD execution model akin to wavefronts, warps, or vectors and permit programming closer to the hardware in a vendor-independent manner. This extension includes a set of cross-sub-group built-in functions that match the set of the cross-work-group built-in functions specified above.


Solution

  • They must have gone to a more dynamical approach called sub-group: https://www.khronos.org/registry/OpenCL/specs/opencl-2.2.pdf

    Sub-group: Sub-groups are an implementation-dependent grouping of work-items within a
    work-group. The size and number of sub-groups is implementation-defined.
    

    and

    Work-groups are further divided into sub-groups,
    which provide an additional level of control over execution.
    

    and

    The mapping of work-items to
    sub-groups is implementation-defined and may be queried at runtime. 
    

    so even if its not called wavefront, its now queryable in run-time and

    In the absence of synchronization functions (e.g. a barrier), work-items within a sub-group may be serialized. In the presence of sub -group functions, work-items within a sub -group may be serialized before any given sub -group function, between dynamically encountered pairs of sub - group functions and between a work-group function and the end of the kernel.

    even lockstep manner may be lost at times.

    On top of these,

     sub_group_all() and
    sub_group_broadcast() and are described in OpenCL C++ kernel language and IL specifications.
    The use of these sub-group functions implies sequenced-before relationships between statements
    within the execution of a single work-item in order to satisfy data dependencies.
    

    saying that some kind of intra-sub-group communication exists. Because now opencl has child-kernel definition:

    Device-side enqueue: A mechanism whereby a kernel-instance is enqueued by a kernel-instance
    running on a device without direct involvement by the host program. This produces nested
    parallelism; i.e. additional levels of concurrency are nested inside a running kernel-instance.
    The kernel-instance executing on a device (the parent kernel) enqueues a kernel-instance (the
    child kernel) to a device-side command queue. Child and parent kernels execute asynchronously
    though a parent kernel does not complete until all of its child-kernels have completed. 
    

    Ultimately, with something like

    kernel void launcher()
    {
        ndrange_t ndrange = ndrange_1D(1);
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange,
        ^{
        size_t id = get_global_id(0);
        }
        );
    }
    

    you should be able to spawn your own (upgraded?)wavefronts with any size you need and they work concurrently with parent kernel(and can communicate intra-sub-group threads) but they are not called wavefronts because they are not hardcoded by hardware imho.


    2.0 api specs saying:

    Extreme care should be exercised when writing code that uses
    subgroups if the goal is to write portable OpenCL applications.
    

    which reminds amd's 16-wide simds and nvidia's 32-wide simds versus some imaginary fpga's 95-wide compute cores. Pseudo-wavefront maybe?