Search code examples
openclreductionprefix-sum

opencl- parallel reduction without local memory


Most of the algorithms for parallel reduction uses shared(local) memory.

Nvidia,AMD, Intel and so on.

But if devices has doesn't have shared(local) memory.

How can I do it?

If i use same algorithms but store temporary value on global memory, is it gonna be work fine?


Solution

  • If I think about it, my comment already was the complete answer.

    Yes, you can use global memory as a replacement for local memory but:

    • you have to allocate enough global memory for all workgroups and assign the workgroups their chunk of memory (since with local memory, you only have to specifiy as much memory as is needed for a single workgroup and each workgroup will allocate the amount of memory specified)
    • you have to use CLK_GLOBAL_MEM_FENCE instead of CLK_LOCAL_MEM_FENCE
    • you will lose a significant amout of performance

    If I have time this evening, I will post a simple example.