parallel-processing kernel opencl gpu raytracing

openCL ray tracing and branching kernel code issue

So I'm trying to implement a ray/path tracer using openCL and it seems pretty straightforward - write a kernel that traces the path of a single ray/pixel/etc and have it execute on multiple rays in parallel.

However, when traversing a scene, a single ray has a considerable number of directions it can take. For instance, depending on the material of the hit object, a ray can either reflect or refract. Additionally, different materials require different shading algorithms. For instance, if one scene object requires a cook-torrance shader and another requires a ward anisotropic shader then different shading functions would need to be called within the kernel.

Based on what I've been reading it is unadvisable to have a kernel with branching code inside of it because it hinders performance. But this seems unavoidable in a ray tracer if I am parallelizing my code based on each ray.

So is a "branching" code structure really that much of a hindrance for Kernel performance? If so, how else would I go about structuring my code to account for this?

Solution

First pass(1M rays), unsigned char array(or even packed single bits)

   ray 0     ------------------  render end  -------------->   0     \
   ray 1     ------------------  surface    --------------->   1      \
   ray 2     ------------------  surface    --------------->   1        }-- bad for SIMD
   ray 3     ------------------  render end  -------------->   0      /
   ray 4     ------------------  surface    --------------->   1     /
   ...
   ...
   ray 1M    ...

Sorting(cache or multiplex this for reuse for refraction and reflection) with surface type(exists / non existent) and surface position (temporal coherency)

   ray 1  \
   ray 2   -------------------- all surfaces --------------> 1   good for simd
   ray 4  /
   ray 0  \
   ray x   -------------------- all render end ------------> 0   good for simd
   ray 3  /

   second pass (refraction)  (1M rays)

   ray 1  ..................... refract ...................> cast a new ray  
   ray 2  ..................... refract ...................> cast a new ray
   ray 4  ..................... refract ...................> cast a new ray
   ray 0  .................... no new ray casting .........> offload some other work/draw 
   ray x  .................... no new ray casting .........> offload some other work/draw
   ray 3  .................... no new ray casting .........> offload some other work/draw

   third pass (reflection) (1M rays)

   ray 1  ..................... reflect...................> cast a new ray  
   ray 2  ..................... reflect...................> cast a new ray
   ray 4  ..................... reflect...................> cast a new ray
   ray 0  .................... no new ray casting .........> offload some other work/draw 
   ray x  .................... no new ray casting .........> offload some other work/draw
   ray 3  .................... no new ray casting .........> offload some other work/draw

now there are two groups of 1M rays, doubling at each iteration. So if you have space for 256M elements, you should be able to cast rays until depth7 or 8. All these could be done on a single array with proper indexing.