Search code examples
pythonperformanceparallel-processingcudascientific-computing

Need an efficient way to plot planes from large sets of 3D coordinates


I have some detector data collected on a 2D camera, I then convert it to a lab frame so I end up with an (x^2+y^2) and z coordinate for each pixel in the image. But then the object rotates about it's normal and there is an img for each rotation. I apply a rotation matrix to (x^2+y^2) to get x and y matrices for each img, so I end up with something like this for each image/angle. So every pixel has a 3D position and intensity.

z                  x            y          img
444444444     123456789     123456789    123456789                  
333333333     123456789     123456789    423466789
222222222     123456789     123456789    223256789
111111111     123456789     123456789    523456689

What I want to do then is extract a plane, i.e. plot a map of x, y for a given z range.

The problem is slightly more complicated by the following:

The labframe is actually curved so I can't rely on each row of x and y being the same. The image size is around 2048x2048x32bits (Tiff) - there can be 1000 images.

My current solution is using CUDA/Numba, I have a function the calculates the z,x,y,img for a given angle, so I do that for all the angles. Each time I then slice a number of rows, and extent a list with the raveled the x,y,img values. And then use scipy.interpolate.griddata to give a 2d map. griddata is pretty slow also, anything on the GPU would probably be better.

The whole process is quite slow, so I'm looking for better solutions or maybe a library already does this? The CUDA code looks something like this,its not so slow it's self:

#constants are q0, angi, rot_direction, SDD, k0, Binv
@cuda.jit
    def detector_to_hkl_kernel(h_glob,k_glob,l_glob,omega_rad):
        #get the current thread position
        j,i = cuda.grid(2)

        if j < h_glob.shape[0] and i < h_glob.shape[1]:
            delta_z= (q0[1]-j)*pixel_y  #real-space dinstance from centre pixel y
            delta_x = (i-q0[0])*pixel_x  #real-space dinstance from centre pixel x
            delR = math.sqrt(delta_x**2 + delta_z**2)            
            dist = math.sqrt(delta_x**2+SDD**2 + delta_z**2) #distance to pixel      

            #lab coorindates of pixel in azimuthal angles
            del_pix  = math.atan(delta_x/ SDD)
            gam_pix = math.atan(delta_z/math.sqrt(delta_x**2 + SDD**2))-angi*math.cos(del_pix)
                            
            #lab coordinates in momenturm transfer                                  
            qx = k0*(math.cos(gam_pix)*math.cos(del_pix)-math.cos(angi))
            qy = k0*(math.cos(gam_pix)*math.sin(del_pix)) 
            qz = k0*(math.sin(gam_pix)+math.sin(angi))

            so = math.sin(rotDirection*omega_rad)
            co = math.cos(rotDirection*omega_rad)
            # we deal with the angle of incidence in the momentum transfer calc
            # so that part of the rotation matrix can be fixed
            ci = 1 #math.cos(angi) 
            si = 0 #math.sin(angi)

            #rotation matrix
            hphi_1 = so*(ci*qy+si*qz)+co*qx
            hphi_2 = co*(ci*qy+si*qz)-so*qx
            hphi_3 = ci*qz-si*qy
                
            #H= Binv dot Hphi 
            # compute the dot product manually 
            h_glob[j,i] = Binv[0][0]*hphi_1+Binv[0][1]*hphi_2+Binv[0][2]*hphi_3
            k_glob[j,i] = Binv[1][0]*hphi_1+Binv[1][1]*hphi_2+Binv[1][2]*hphi_3
            l_glob[j,i] = Binv[2][0]*hphi_1+Binv[2][1]*hphi_2+Binv[2][2]*hphi_3              
            
        
    h_global_mem  = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))
    k_global_mem  = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))
    l_global_mem  = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))                  

    # Configure the blocks
    threadsperblock = (16, 16)
    blockspergrid_x = int(math.ceil(pixel_count_y / threadsperblock[0]))
    blockspergrid_y = int(math.ceil(pixel_count_x / threadsperblock[1]))
    blockspergrid = (blockspergrid_x, blockspergrid_y)

    detector_to_hkl_kernel[blockspergrid, threadsperblock](h_global_mem,k_global_mem,l_global_mem, omega_rad)        
    return [h_global_mem.copy_to_host(),k_global_mem.copy_to_host(),l_global_mem.copy_to_host()]  

Solution

  • First of all, note that you use double-precision here and that mainstream middle-end consumer GPUs are very slow to compute double-precision floating-point numbers. Indeed, a GTX 1660 Super GPU has a computing power of 5027 GFlops for simple-precision and only 157 GFlops for double-precision (32 times slower). One simple solution is to use simple-precision floating-point numbers in your code by specifying dtype=np.float32 or converting arrays using array.astype(np.float32). An alternative expensive solution could be to use professional GPUs dedicated for that if you cannot use simple precision or mixed-precision.

    Moreover, several expressions can be pre-computed ahead of time and stored in constants. This includes for example math.cos(angi), math.sin(angi) and 1.0/SDD. Some other expressions can be stored in temporary variables as the compiler may not be able to efficiently factorize the code (mainly because of trigonometric functions).

    Additionally, trigonometric functions are often very expensive, especially when you want the computation to be IEEE-754 compliant (which is likely the case for math.xxx calls). You can use approximations instead. CUDA provides the __cosf, __sinf and __tanf intrinsics for that which should be much faster (but be careful of the result if you use them). I am not sure you can call them directly but you can add the parameter fastmath=True to the JIT decorator which can do that for you.

    I think using a 2D thread block of 32x8 may be a bit faster because threads are packed in warps containing 32 threads and the GPU. But the best solution is to check the performance with many different block sizes.

    If all of this is not enough, you could try to use the shared memory to reduce the amount of instruction done per bloc as some expression are recomputed multiple times per bloc.