I have some detector data collected on a 2D camera, I then convert it to a lab frame so I end up with an (x^2+y^2
) and z
coordinate for each pixel in the image. But then the object rotates about it's normal and there is an img for each rotation. I apply a rotation matrix to (x^2+y^2
) to get x
and y
matrices for each img
, so I end up with something like this for each image/angle. So every pixel has a 3D position and intensity.
z x y img
444444444 123456789 123456789 123456789
333333333 123456789 123456789 423466789
222222222 123456789 123456789 223256789
111111111 123456789 123456789 523456689
What I want to do then is extract a plane, i.e. plot a map of x, y for a given z range.
The problem is slightly more complicated by the following:
The labframe is actually curved so I can't rely on each row of x and y being the same. The image size is around 2048x2048x32bits (Tiff) - there can be 1000 images.
My current solution is using CUDA/Numba, I have a function the calculates the z
,x
,y
,img
for a given angle, so I do that for all the angles. Each time I then slice a number of rows, and extent a list with the raveled the x
,y
,img
values. And then use scipy.interpolate.griddata
to give a 2d map. griddata
is pretty slow also, anything on the GPU would probably be better.
The whole process is quite slow, so I'm looking for better solutions or maybe a library already does this? The CUDA code looks something like this,its not so slow it's self:
#constants are q0, angi, rot_direction, SDD, k0, Binv
@cuda.jit
def detector_to_hkl_kernel(h_glob,k_glob,l_glob,omega_rad):
#get the current thread position
j,i = cuda.grid(2)
if j < h_glob.shape[0] and i < h_glob.shape[1]:
delta_z= (q0[1]-j)*pixel_y #real-space dinstance from centre pixel y
delta_x = (i-q0[0])*pixel_x #real-space dinstance from centre pixel x
delR = math.sqrt(delta_x**2 + delta_z**2)
dist = math.sqrt(delta_x**2+SDD**2 + delta_z**2) #distance to pixel
#lab coorindates of pixel in azimuthal angles
del_pix = math.atan(delta_x/ SDD)
gam_pix = math.atan(delta_z/math.sqrt(delta_x**2 + SDD**2))-angi*math.cos(del_pix)
#lab coordinates in momenturm transfer
qx = k0*(math.cos(gam_pix)*math.cos(del_pix)-math.cos(angi))
qy = k0*(math.cos(gam_pix)*math.sin(del_pix))
qz = k0*(math.sin(gam_pix)+math.sin(angi))
so = math.sin(rotDirection*omega_rad)
co = math.cos(rotDirection*omega_rad)
# we deal with the angle of incidence in the momentum transfer calc
# so that part of the rotation matrix can be fixed
ci = 1 #math.cos(angi)
si = 0 #math.sin(angi)
#rotation matrix
hphi_1 = so*(ci*qy+si*qz)+co*qx
hphi_2 = co*(ci*qy+si*qz)-so*qx
hphi_3 = ci*qz-si*qy
#H= Binv dot Hphi
# compute the dot product manually
h_glob[j,i] = Binv[0][0]*hphi_1+Binv[0][1]*hphi_2+Binv[0][2]*hphi_3
k_glob[j,i] = Binv[1][0]*hphi_1+Binv[1][1]*hphi_2+Binv[1][2]*hphi_3
l_glob[j,i] = Binv[2][0]*hphi_1+Binv[2][1]*hphi_2+Binv[2][2]*hphi_3
h_global_mem = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))
k_global_mem = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))
l_global_mem = cuda.to_device(np.zeros((pixel_count_y,pixel_count_x)))
# Configure the blocks
threadsperblock = (16, 16)
blockspergrid_x = int(math.ceil(pixel_count_y / threadsperblock[0]))
blockspergrid_y = int(math.ceil(pixel_count_x / threadsperblock[1]))
blockspergrid = (blockspergrid_x, blockspergrid_y)
detector_to_hkl_kernel[blockspergrid, threadsperblock](h_global_mem,k_global_mem,l_global_mem, omega_rad)
return [h_global_mem.copy_to_host(),k_global_mem.copy_to_host(),l_global_mem.copy_to_host()]
First of all, note that you use double-precision here and that mainstream middle-end consumer GPUs are very slow to compute double-precision floating-point numbers. Indeed, a GTX 1660 Super GPU has a computing power of 5027 GFlops for simple-precision and only 157 GFlops for double-precision (32 times slower). One simple solution is to use simple-precision floating-point numbers in your code by specifying dtype=np.float32
or converting arrays using array.astype(np.float32)
. An alternative expensive solution could be to use professional GPUs dedicated for that if you cannot use simple precision or mixed-precision.
Moreover, several expressions can be pre-computed ahead of time and stored in constants. This includes for example math.cos(angi)
, math.sin(angi)
and 1.0/SDD
. Some other expressions can be stored in temporary variables as the compiler may not be able to efficiently factorize the code (mainly because of trigonometric functions).
Additionally, trigonometric functions are often very expensive, especially when you want the computation to be IEEE-754 compliant (which is likely the case for math.xxx
calls). You can use approximations instead. CUDA provides the __cosf
, __sinf
and __tanf
intrinsics for that which should be much faster (but be careful of the result if you use them). I am not sure you can call them directly but you can add the parameter fastmath=True
to the JIT decorator which can do that for you.
I think using a 2D thread block of 32x8 may be a bit faster because threads are packed in warps containing 32 threads and the GPU. But the best solution is to check the performance with many different block sizes.
If all of this is not enough, you could try to use the shared memory to reduce the amount of instruction done per bloc as some expression are recomputed multiple times per bloc.