Balancing blocks, threads and workgroups?

I have an application (did not create myself) that requires three parameters

Blocks
Threads
Points (number of calcs per thread I'm assuming)

It uses OpenCL and I have an RX 580. My current efficiency is low.

The GPU has 2304 modules in 36 compute units

Now I have played around with different values but I have no idea what would be the most optimal starting point because I don't know how blocks and threads relate to the compute units. Any help would be greatly appreciated in understanding how to decide #of blocks, #of threads per block and #of calculations per thread.

Thank you so much

Solution

I'm going to make the same assumptions you have:

Blocks: Number of workgroups
Thread: Number of threads
Points: Some metric of work per thread

Its more important to set the correct workgroup size rather than the number of workgroups. You will want the group size to be a minimum of the SIMD width which is usually 32 on most GPUs. So blocks should be set to Threads / 32.

For "Points". This will depend on how much work is done per "calc". There is overhead with kicking off a workgroup so you want to make sure each thread has enough work to do. From experience ~16 instructions is usually enough. But if you can't see the kernel code then you will just have to experiment.

In summary:

Set "Points" so that you have at least 2304 threads for the work you need
Set Blocks to threads / 32

All of this is assuming you have at least 2304 work items otherwise you are not fully utilising your hardware.