multithreading directx hlsl directcompute

DirectCompute multithreading performance (threads and thread groups) for multidimensional array processing

I understand that Disptach(x, y, z) will defines how many groups of threads are instanciated and numthreads(n, m, p) gives the size of each group.

Combined together, Dispatch and numthreads give the total number of threads. I also understood that dispatch arguments are used to pass parameters to each thread.

Questions:

1) Is there performance difference between I groups of J threads and J groups on I threads? Both options giving the same number of threads.

2) Assuming I have to process a two dimentional matrix of size only known at runtime, it is convenient to use Dispatch(DimX, DimY, 1) and numthreads(1, 1, 1) so that I have exactly on thread per matrix element whose postion is given by DTid.xy. Since numthreads() arguments are determined at compile time, how can I have the exact number of threads required to process a matrix having dimensions not multiple of the thread group size and not known at compile time?

Solution

1) Yes, there is (or can be) a performance difference, depending on the actual numbers and on the used hardware!

GPUs (usually) contain multiple so-called "waves" of threads. These waves work in a SIMD-like fashion (All threads in a wave are allways executing the same operations at the same time). The exact number of threads per wave is vendor-specific, but is usually 32 (all NVidia GPUs I know of) or 64 (most AMD GPUs).

A single group of threads can be distributed to multiple waves. However, a single wave can only execute threads of the same group. Therefore, if your number of threads per group is not an multiple of the hardware's wave size, there are some threads in a wave that are "idling" (They are actually doing the same things as the other ones, but aren't allowed to write into memory), so you are "loosing" performance that you would get with an better number of threads.

2) You would most likely select a thread count that's suitable to your hardware (64 would be a good default value, as it is also a multiple of 32), and use branching to mark threads as "inactive" that are outside of your matrix (you can pass the size of the matrix/data to the shader using a constant buffer). Since these inactive threads aren't doing anything at all, the hardware can simply mask them as "read-only" (similar to how they would be handled if the number of threads per group is smaller then the wave size), which is quite cheap. If all threads in a wave are marked inactive, the hardware can even choose to skip the work of this wave completly, which would be optimal.

You could also use padding to make sure that your matrix/data is allways a multiple of the number of threads per group, eg with zeroes or the identity matrix or whatever. However, whether this can be done depends on the application, and I would assume that branching would be as fast - if not faster - in most cases.