Search code examples
multithreadinggraphicsvulkanhlslcompute-shader

HLSL num threads and dispatch (Vulkan), how to dispatch efficiently?


I am writing a particle simulation in VUlkan using HLSL compiled into spirv with DXC on linux.

I am realising I have a miss understanding on how numthreads works.

Let me share a small snippet:

[numthreads(100, 1, 1)]
void main(uint3 id : SV_DispatchThreadID)
{
    uint index = id.x;

    simulate_particle(index);
    transfer_velocity_to_grid(min(index, 100));
}

When I do this I seem to be getting more than 100 threads so my index is going out of the requested bounds. On the other hand this is fine:

[numthreads(1, 1, 1)]
void main(uint3 id : SV_DispatchThreadID)
{
    uint index = id.x;

    simulate_particle(index);
    transfer_velocity_to_grid(min(index, 100));
}

i.e. the above spawns exactly 100 threads. On the CPU side in both versions I am requesting a vector of (100, 1, 1) work threads per dimension. So it seems in the first version what I was doing was invoking 100*100 threads rather than just 100.

But of course with the version that works I am underutilizing the CPU.

I could declare an amount of numthreads of (100, 1, 1) inside the shader and only dispatch (1, 1, 1) work groups. But this has the problem that if the amount to dispatch changes at runtime I can't update it.

If the amount to send is always a multiple of numthreads then I could just dispatch total_dispatches / numthreads.

But if the amount of work groups I need is not such a multiple, I don't know how to efficiently dispatch my work groups. I would be either sending more or less groups than I actually need and in both cases I will run into errors.

Is there a way to send exactly a dynamic amount of work to the compute shader, spawn exactly that many threads and efficiently use the GPU to execute them?


Solution

  • The numthreads defines the number of threads per workgroup.

    The Vulkan vkCmdDispatch() defines the total number of workgroups. The dispatch function parameters can be changed at runtime to adjust the total size of the workspace; you don't need to change your shader at all.

    Note that the size MUST be a multiple of workgroups, so if your problem space does not fill the work space, you need to handle the overspill (e.g. by skipping writes, or just writing dummy data to an overspill region, that you ignore later).

    The optimal size of the workgroup depends on what you are doing. Single item workgroups may be fine, and allow arbitrary sizing of the problem domain, but you may want to match the hardware subgroup size for some use cases (e.g. if you want to use subgroup operations).