Most of the algorithms for parallel reduction uses shared(local) memory.
Nvidia,AMD, Intel and so on.
But if devices has doesn't have shared(local) memory.
How can I do it?
If i use same algorithms but store temporary value on global memory, is it gonna be work fine?
If I think about it, my comment already was the complete answer.
Yes, you can use global memory as a replacement for local memory but:
If I have time this evening, I will post a simple example.