GPU What are the proper thread dimensions for a compute shader with a very large work load?

I'm working on a heightmap erosion compute shader in unity, where each point on the map is eroded separately. This is working well for small maps, but the project I'm working on requires 4096x4096 maps. This means 4096^2 = 16777216 points to simulate. With the default thread dimensions of [64,1,1], this creates 262144 thread groups, way more than the allowed limit of 65535.

My question is:
Can I simply raise the thread dimensions, and what do I have to consider in terms of performance when I do? Is it maybe possible to simply run the shader multiple times, with different ranges of heightmap coordinates?

This is my first time working with shaders. The tutorials I've seen online quickly go too in depth into gpu hardware specifications, so I didn't pick up much from that.

Solution

With 64x64 threads per work group, you can Dispatch 64x64 work groups to do what you need : remember that 64x64 threads will be invoked for each work group you dispatch, so you will have 64x64 work groups x 64x64 threads = 4096 workgroups x 4096 threads executed.

computeShader.Dispatch(computeShader.FindKernel("kernel"), 64, 64, 1);

[numthreads(64, 64, 1)]
void kernel(uint3 id : SV_DispatchThreadID)
{
    // ...
    // 0 <= id.x < 4096
    // 0 <= id.y < 4096
}

As for the performance implication, the general answer is "try it out !" : run your kernel with different sizes for threads and work groups. The results may vary depending on your computations and on your hardware.

But, in case you need to bypass the 65535 limit, you can use DispatchIndirect. Basically, it's the same as Dispatch but the arguments are passed through a ComputeBuffer.

ComputeBuffer argsBuffer = new ComputeBuffer(3, sizeof(uint), ComputeBufferType.IndirectArguments);
uint[] args = { 64, 64, 1 }; // work groups
argsBuffer.SetData(args);
computeShader.DispatchIndirect(computeShader.FindKernel("kernel"), argsBuffer);

Ps : working on a GPU requires understanding its architecture because (1) you work at a low level, close to the hardware and many of the features you work with are actually hardware implemented (e.g. textures); (2) you want to make the best performances out of your programs (e.g. make best use of blocks and warps and cache ...) ;)