unity-game-engine shader hlsl compute-shader

Difference Between Calling numthreads and Dispatch in a Unity Compute Shader

Hypothetically, say I wanted to use a compute shader to run Kernel_X using thread dimensions of (8, 1, 1).

I could set it up as:

In Script:

Shader.Dispatch(Kernel_X, 8, 1, 1);

In Shader:

[numthreads(1,1,1)]
void Kernel_X(uint id : SV_DispatchThreadID) { ... }

or I could set it up like this:

In Script:

Shader.Dispatch(Kernel_X, 1, 1, 1);

In Shader:

[numthreads(8,1,1)]
void Kernel_X(uint id : SV_DispatchThreadID) { ... }

I understand that at the end of this code, the dimensions would come out to be (8, 1, 1); however, I was wondering how switching up the numbers actually differed from each other. My guess would be that running Dispatch (Kernel_X, 8, 1, 1), "ran" a kernel of 1x1x1 8 times, while running numthreads(8,1,1) would run an 8x1x1 kernel once.

Solution

To understand the difference, a bit of hardware knowledge is required:

Internally, a GPU works on so-called wave fronts, which are SIMD-style processing units (Like a group of threads, where each thread can have it's own data, but they all have to execute the exact same instruction at the exact same time, allways). The number of Threads per wave front is hardware dependent, but is usual either 32 (NVidia) or 64 (AMD).

Now, with [numthreads(8,1,1)] you request a shader thread group size of 8 x 1 x 1 = 8 threads, which the hardware is free to distribute among it's wave fronts. So, with 32 threads per wave front, the hardware would schedule one wave front per shader group, with 8 active threads in that wave front (the other 24 threads are "inactive", meaning they do the same work, but are discarding any memory writes). Then, with Dispatch(1, 1, 1), you are dispatching one such shader group, meaning there will be one wave front running on the hardware.

Would you use [numthreads(1,1,1)] instead, only one thread in a wave front could be active. So, by calling Dispatch(8, 1, 1) on that one, the hardware would require to run 8 shader groups (= 8 wave fronts), each one running just with 1/32 active threads, so while you would get the same result, you would waste a lot more computational power.

So, in general, for optimal performance you want to have shader group sizes that are multiples of 32 (or 64), while trying to call Dispatch with as low numbers as reasonable possible.