I have an application (did not create myself) that requires three parameters
It uses OpenCL and I have an RX 580. My current efficiency is low.
The GPU has 2304 modules in 36 compute units
Now I have played around with different values but I have no idea what would be the most optimal starting point because I don't know how blocks and threads relate to the compute units. Any help would be greatly appreciated in understanding how to decide #of blocks, #of threads per block and #of calculations per thread.
Thank you so much
I'm going to make the same assumptions you have:
Blocks: Number of workgroups
Thread: Number of threads
Points: Some metric of work per thread
Its more important to set the correct workgroup size rather than the number of workgroups. You will want the group size to be a minimum of the SIMD width which is usually 32 on most GPUs. So blocks should be set to Threads / 32.
For "Points". This will depend on how much work is done per "calc". There is overhead with kicking off a workgroup so you want to make sure each thread has enough work to do. From experience ~16 instructions is usually enough. But if you can't see the kernel code then you will just have to experiment.
In summary:
All of this is assuming you have at least 2304 work items otherwise you are not fully utilising your hardware.