I am working with CUDA and 3D textures in python (using pycuda). There is a function called Memcpy3D which has the same members as Memcpy2D plus a few extras. In it it calls you to describe things such as width_in_bytes
, src_pitch
, src_height
, height
and copy_depth
. This is what I am struggling with (in 3D) and its relevance with C or F style indexing. For instance, if I simply change the ordering from F to C in the working example below, it stops working - and I don't know why.
threadIdx.x
(or the x direction, or a column). So for a float32 array of C shape (3,2,4), to move one value in x I expect to move 4 values in memory (as the indexing goes down the z axis first?). Therefore my pitch would be 4*32bits.height
to be the number of rows. (In this example, 3)width
to be the number of cols. (In this example, 2)depth
to be the number of z slices. (In this example, 4)width_in_bytes
to be the width of a row in x inclusive of the z elements behind it, i.e. a row slice, (0,:,:). This would be how many addresses in memory it takes to transverse one element in the y-direction.So when I change the ordering from F to C in the code below, and adapt the code to change the height/width values accordingly it still doesn't work. It just presents a logic failure which makes me think I'm not understanding the concept of pitch, width, height, depth correctly.
Please educate me.
Below is a full working script that copies an array to the GPU as a texture and copies the contents back.
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
w = 2
h = 3
d = 4
shape = (w, h, d)
a = np.arange(24).reshape(*shape,order='F').astype('float32')
print(a.shape,a.strides)
print(a)
descr = drv.ArrayDescriptor3D()
descr.width = w
descr.height = h
descr.depth = d
descr.format = drv.dtype_to_array_format(a.dtype)
descr.num_channels = 1
descr.flags = 0
ary = drv.Array(descr)
copy = drv.Memcpy3D()
copy.set_src_host(a)
copy.set_dst_array(ary)
copy.width_in_bytes = copy.src_pitch = a.strides[1]
copy.src_height = copy.height = h
copy.depth = d
copy()
mod = SourceModule("""
texture<float, 3, cudaReadModeElementType> mtx_tex;
__global__ void copy_texture(float *dest)
{
int x = threadIdx.x;
int y = threadIdx.y;
int z = threadIdx.z;
int dx = blockDim.x;
int dy = blockDim.y;
int i = (z*dy + y)*dx + x;
dest[i] = tex3D(mtx_tex, x, y, z);
}
""")
copy_texture = mod.get_function("copy_texture")
mtx_tex = mod.get_texref("mtx_tex")
mtx_tex.set_array(ary)
dest = np.zeros(shape, dtype=np.float32, order="F")
copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
print(dest)
Not sure I fully understand the problem in your code, but I'll attempt to clarify.
In CUDA, width
(x
) refers to the fastest-changing dimension, height
(y
) is the middle dimension, and depth
(z
) is the slowest-changing dimension. The pitch
refers to the stride in bytes required to step between values along the y
dimension.
In Numpy, an array defined as np.empty(shape=(3,2,4), dtype=np.float32, order="C")
has strides=(32, 16, 4)
, and corresponds to width=4
, height=2
, depth=3
, pitch=16
.
Using "F"
ordering in Numpy means the order of dimensions is reversed in memory.
Your code appears to work if I make the following changes:
#shape = (w, h, d)
shape = (d, h, w)
#a = np.arange(24).reshape(*shape,order='F').astype('float32')
a = np.arange(24).reshape(*shape,order='C').astype('float32')
...
#dest = np.zeros(shape, dtype=np.float32, order="F")
dest = np.zeros(shape, dtype=np.float32, order="C")
#copy_texture(drv.Out(dest), block=shape, texrefs=[mtx_tex])
copy_texture(drv.Out(dest), block=(w,h,d), texrefs=[mtx_tex])