Search code examples
openglgpgpucompute-shader

What's the point for compute shader to have local size in addition to work groups?


What's the difference between

void glDispatchCompute(1​, 1​, 1);
layout(local_size_x = 100​, local_size_y = 100​, local_size_z = 1​) in;  

and

void glDispatchCompute(100​, 100, 1);
layout(local_size_x = 1​, local_size_y = 1​, local_size_z = 1​) in;  

As they execute the same total number of invocations . The only difference I can see is conceptually you have one large work group and inside which there are 100x100 invocations to call , or you have a collection of 100x100 groups , for each group you only need to call once . But this is merely conceptual . Is there real effect on performance ?


Solution

  • The difference between workgroup size and number of work groups is not purely conceptual. For example:

    • Invocations in the same workgroup are able to use the same shared memory. Invocations in different workgroups aren't able to share data directly
    • Barriers only affect invocations in the same workgroup. There is no way to synchronize invocations in different workgroups.

    The performance might be different depending on how your driver maps compute shader invocations to mapped to the SIMD units of the GPU (32/64/... units depending on the GPU). It is very likely that invocations inside the same workgroup are actually executed in parallel (up to the number of units). It is not unlikely that invocations from different workgroups are executed sequentially, although I've also seen GPU executing multiple workgroups at the same time. There is no guarantee in the OpenGL standard on how invocations are mapped to execution units or warps, thus the mapping used on your machine will depend a lot on the hardware used and on the driver.

    For the best performance for a specific shader, you will need to profile different combinations of workgroup size and number of workgroups, but this articles might give you some more hints on how to determine sizes:

    OpenGL compute shader mapping to nVidia warps
    Which distribution of work in a compute shader leads to more performance? (Reddit)
    Do compute shaders only parallelize up to local workgroup size? (Reddit)