PyCUDA - passing a matrix by reference from python to C++ CUDA code

I have to write in a PyCUDA function that gets two matrices Nx3 and Mx3, and return a matrix NxM, but I can't figure out how to pass by reference a matrix without knowing the number of columns.

My code basically is something like that:

#kernel declaration
mod = SourceModule("""
__global__ void distance(int N, int M, float d1[][3], float d2[][3], float res[][M])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    float x, y, z;
    x = d2[j][0]-d1[i][0];
    y = d2[j][1]-d1[i][1];
    z = d2[j][2]-d1[i][2];
    res[i][j] = x*x + y*y + z*z;
}
""")

#load data
data1 = numpy.loadtxt("data1.txt").astype(numpy.float32) # Nx3 matrix
data2 = numpy.loadtxt("data2.txt").astype(numpy.float32) # Mx3 matrix
N=data1.shape[0]
M=data2.shape[0]
res = numpy.zeros([N,M]).astype(numpy.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(cuda.In(numpy.int32(N)), cuda.In(numpy.int32(M)), cuda.In(data1), cuda.In(data2), cuda.Out(res), block=(N,M,1))

#save data
numpy.savetxt("results.txt", res)

Compiling this I receive an error:

kernel.cu(3): error: a parameter is not allowed

that is, I cannot use M as the number of columns for res[][] in the declaretion of the function. I cannot either left the number of columns undeclared...

I need a matrix NxM as an output, but I can't figure out how to do this. Can you help me?

Solution

You should use pitched linear memory access inside the kernel, that is how ndarray and gpuarray store data internally, and PyCUDA will pass a pointer to the data in gpu memory allocated for a gpuarray when it is supplied as a argument to a PyCUDA kernel. So (if I understand what you are trying to do) your kernel should be written as something like:

__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}

Here I have assumed the numpy default row major ordering in defining the idx2d helper function. There are still problems with the Python side of the code you posted, but I guess you know that already.

EDIT: Here is a complete working repro case based of the code posted in your question. Note that it only uses a single block (like the original), so be mindful of block and grid dimensions when trying to run it on anything other than trivially small cases.

import numpy as np
from pycuda import compiler, driver
from pycuda import autoinit

#kernel declaration
mod = compiler.SourceModule("""
__device__ unsigned int idx2d(int i, int j, int lda)
{
    return j + i*lda;
}

__global__ void distance(int N, int M, float *d1, float *d2, float *res)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    int j = threadIdx.y + blockDim.y * blockIdx.y;
    float x, y, z;
    x = d2[idx2d(j,0,3)]-d1[idx2d(i,0,3)];
    y = d2[idx2d(j,1,3)]-d1[idx2d(i,1,3)];
    z = d2[idx2d(j,2,3)]-d1[idx2d(i,2,3)];

    res[idx2d(i,j,N)] = x*x + y*y + z*z;
}
""")

#make data
data1 = np.random.uniform(size=18).astype(np.float32).reshape(-1,3)
data2 = np.random.uniform(size=12).astype(np.float32).reshape(-1,3)
N=data1.shape[0]
M=data2.shape[0]
res = np.zeros([N,M]).astype(np.float32) # NxM matrix

#invoke kernel
dist_gpu = mod.get_function("distance")
dist_gpu(np.int32(N), np.int32(M), driver.In(data1), driver.In(data2), \
        driver.Out(res), block=(N,M,1), grid=(1,1))

print res