I have to write in a PyCUDA function that gets two matrices Nx3 and Mx3, and return a matrix NxM, but I can't figure out how to pass by reference a matrix without knowing the number of columns.
My code basically is something like that:
#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
{
int i = threadIdx.x;
int j = threadIdx.y;
float x, y, z;
x = d2[j][0]-d1[i][0];
y = d2[j][1]-d1[i][1];
z = d2[j][2]-d1[i][2];
res[i][j] = x*x + y*y + z*z;
}
""")
#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
N=data1.shape[0]
M=data2.shape[0]
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))
#save data
numpy.savetxt("results.txt", res)
Compiling this I receive an error:
kernel.cu(3): error: a parameter is not allowed
that is, I cannot use M as the number of columns for res[][] in the declaretion of the function. I cannot either left the number of columns undeclared...
I need a matrix NxM as an output, but I can't figure out how to do this. Can you help me?
You should use pitched linear memory access inside the kernel, that is how ndarray
and gpuarray
store data internally, and PyCUDA will pass a pointer to the data in gpu memory allocated for a gpuarray
when it is supplied as a argument to a PyCUDA kernel. So (if I understand what you are trying to do) your kernel should be written as something like:
__device__ unsigned int idx2d(int i, int j, int lda)
{
return j + i*lda;
}
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
}
Here I have assumed the numpy
default row major ordering in defining the idx2d
helper function. There are still problems with the Python side of the code you posted, but I guess you know that already.
EDIT: Here is a complete working repro case based of the code posted in your question. Note that it only uses a single block (like the original), so be mindful of block and grid dimensions when trying to run it on anything other than trivially small cases.
import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit
#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
{
return j + i*lda;
}
__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int j = threadIdx.y + blockDim.y * blockIdx.y;
float x, y, z;
x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];
res[idx2d(i,j,N)] = x*x + y*y + z*z;
}
""")
#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
N=data1.shape[0]
M=data2.shape[0]
res = np.zeros([N,M]).astype(np.float32) # NxM matrix
#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
driver.Out(res), block=(N,M,1), grid=(1,1))
print res