What I mean by the title is, I have implemented a ray tracing program. In the program, I divided my arrays into chunks because of the memory issiues. After each chunk of rays traced, I am sending new chunks to the openCL kernel. But in a chunk, when only one ray still bouncing and all other rays are traced, already finished rays will need to wait for one ray. This is not very efficient. What I want to do is, assign a chunk of rays to each thread in CPU side and when a thread finished with its chunk they will send a signal to the cpu and I will assign that thread another chunk of rays.
Thank you for your advices.
This is not possible. Within a workgroup (here chunk of rays), the finished rays always have to wait for the remaining ones.
What you can do however is to make a chunk compute the rays for a square of 8x8 pixels on the screen, so that there is a high chance that all these rays (with very similar directions) also do a similar amount of intersections. This is faster than making a chunk a stripe of 64x1 pixels, where the probability is larger that bounces will be different between left and right side of the stripe. See my technical talk on this.
Finally, your kernel range should be the entire image at once, with only a single execution. Don't schedule single workgroups one-by-one, let the GPU scheduler do this for you.