Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.

So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.

It will look a bit like that:

kernel void multiplyBy2(constant float* in [[buffer(0)]],
                            device float* out [[buffer(1)]],
                            constant Uniforms& uniforms [[buffer(2)]],
                            uint gid [[thread_position_in_grid]])
{

    if (gid >= uniforms.buffer_size) { return; }
    out[gid] = in[gid] * 2.0;
}

The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?

Would it be more effective to, for example, dispatch 4 times less threads, that do something like that

    out[gid * 4 + 0] = in[gid + 0] * 2.0;
    out[gid * 4 + 1] = in[gid + 1] * 2.0;
    out[gid * 4 + 2] = in[gid + 2] * 2.0;
    out[gid * 4 + 3] = in[gid + 3] * 2.0;

So that thread can work a little bit longer? Or it is better to make threads as thin as possible?

Solution

Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.

For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.

If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.

Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.