How do I reliably query SIMD group size for Metal Compute Shaders? threadExecutionWidth doesn't always match

I'm trying to use the SIMD group reduction/prefix functions in a series of reasonably complex compute kernels in a Mac app. I need to allocate some threadgroup memory for coordinating between SIMD groups in the same thread group. This array should therefore should have a capacity depending on [[simdgroups_per_threadgroup]], but that's not a compile time value, so it can't be used as an array dimension.

Now, according to various WWDC session videos, threadExecutionWidth on the pipeline object should return the SIMD group size, with which I could then allocate an appropriate amount of memory using setThreadgroupMemoryLength:atIndex: on the compute encoder.

This works consistently on some hardware (e.g. Apple M1, threadExecutionWidth always seems to report 32) but I'm hitting configurations where threadExecutionWidth does not match apparent SIMD group size, causing runtime errors due to out of bounds access. (e.g. on Intel UHD Graphics 630, threadExecutionWidth = 16 for some complex kernels, although SIMD group size seems to be 32)

So:

Is there a reliable way to query SIMD group size for a compute kernel before it runs?
Alternately, will the SIMD group size always be the same for all kernels on a device?

If the latter is at least true, I can presumably trust threadExecutionWidth for the most trivial of kernels? Or should I submit a trivial kernel to the GPU which returns [[threads_per_simdgroup]]?

I suspect the problem might occur in kernels where Metal offers an "odd" (non-pow2) maximum thread group sizes, although in the case I'm encountering, the maximum threadgroup size is reported as 896, which is an integer multiple of 32, so it's not as if it's using the greatest common denominator between max threadgroup size and SIMD group size for threadExecutionWidth.

Solution

I never found a particularly satisfying solution to this, but I did at least find an effective one:

Pass the expected SIMD group size as a kernel argument, which was used as basis for allocating buffer sizes. Start this off as threadExecutionWidth.
As the first part of the compute kernel, compare this to the actual value of simdgroups_per_threadgroup. If it matches, great, run the rest of the kernel.
If it doesn't match, return the actual SIMD size in a feedback/error reporting variable/field in a device argument memory buffer. Then early-out of the compute kernel.
On the host side, check if the compute kernel exited early via the status in device memory. If so inspect the reported SIMD group size, adjust buffer allocations, then re-run the kernel with the new value.

For the truly paranoid, it may be wise to make the check in step 2 a lower or upper bound or perhaps a range, rather than an equality check: e.g., the allocated memory is safe for SIMD group sizes up to or from N threads. That way, if threadgroup buffer allocations should change simdgroups_per_threadgroup (😱) you don't end up bouncing backwards and forwards between vaulues, making no progress.

Also pay attention to what you do in SIMD groups: not all GPU models support SIMD group reduction functions, even if they support SIMD permutations, so ship alternate versions of kernels for such older GPUs if necessary.

Finally, I've found most GPUs to report SIMD group sizes of 32 threads, but Intel Iris Graphics 6100 from ~2015 MacBook Pros reports a simdgroups_per_threadgroup (and threadExecutionWidth) value of 8. (And it doesn't support SIMD reduction functions, but does support SIMD permutation functions including simd_ballot() which can be almost as effective as reductions for some algorithms.)