So I'm trying to implement a ray/path tracer using openCL and it seems pretty straightforward - write a kernel that traces the path of a single ray/pixel/etc and have it execute on multiple rays in parallel.
However, when traversing a scene, a single ray has a considerable number of directions it can take. For instance, depending on the material of the hit object, a ray can either reflect or refract. Additionally, different materials require different shading algorithms. For instance, if one scene object requires a cook-torrance shader and another requires a ward anisotropic shader then different shading functions would need to be called within the kernel.
Based on what I've been reading it is unadvisable to have a kernel with branching code inside of it because it hinders performance. But this seems unavoidable in a ray tracer if I am parallelizing my code based on each ray.
So is a "branching" code structure really that much of a hindrance for Kernel performance? If so, how else would I go about structuring my code to account for this?
First pass(1M rays), unsigned char array(or even packed single bits)
ray 0 ------------------ render end --------------> 0 \
ray 1 ------------------ surface ---------------> 1 \
ray 2 ------------------ surface ---------------> 1 }-- bad for SIMD
ray 3 ------------------ render end --------------> 0 /
ray 4 ------------------ surface ---------------> 1 /
...
...
ray 1M ...
Sorting(cache or multiplex this for reuse for refraction and reflection) with surface type(exists / non existent) and surface position (temporal coherency)
ray 1 \
ray 2 -------------------- all surfaces --------------> 1 good for simd
ray 4 /
ray 0 \
ray x -------------------- all render end ------------> 0 good for simd
ray 3 /
second pass (refraction) (1M rays)
ray 1 ..................... refract ...................> cast a new ray
ray 2 ..................... refract ...................> cast a new ray
ray 4 ..................... refract ...................> cast a new ray
ray 0 .................... no new ray casting .........> offload some other work/draw
ray x .................... no new ray casting .........> offload some other work/draw
ray 3 .................... no new ray casting .........> offload some other work/draw
third pass (reflection) (1M rays)
ray 1 ..................... reflect...................> cast a new ray
ray 2 ..................... reflect...................> cast a new ray
ray 4 ..................... reflect...................> cast a new ray
ray 0 .................... no new ray casting .........> offload some other work/draw
ray x .................... no new ray casting .........> offload some other work/draw
ray 3 .................... no new ray casting .........> offload some other work/draw
now there are two groups of 1M rays, doubling at each iteration. So if you have space for 256M elements, you should be able to cast rays until depth7 or 8. All these could be done on a single array with proper indexing.