pycuda ,cuda -- some questions and a simple code that gives me error "identifier "N" is undefined "

i am trying to learn pycuda and i have a few questions that i am trying to understand. I think my main question is how to communicate between pycuda and a function inside a cuda file.

So,if I have a C++ file (cuda file) and in there i have some functions and i want to implement pycuda in one of them.For example ,lets say i want the function 'compute' which contains some arrays and do calculations on them.What would be my approach?

1) Initialize the arrays in python,allocate memory to GPU and transfer data to GPU.

2) Call the mod=SourceModule(""" global void ......""") from pycuda.

Now, i want to ask :How i will handle this module?I will put all the 'compute' function in it?Because,if only do some calculations in 'global' ,i don't know how to communicate then between pycuda and c++ functions.How i will pass my results back to c++ file(cuda file).

3) In cuda we have the number of threads as 'blockDIm' and the number of blocks as 'gridDim'.In pycuda?We have block size ,block(4,4,1) which means 16 threads??And grid size, size(16,16) means 256 blocks?

4) I tried to do in pycuda an example from 'cuda by an example book' which adds vectors.The code is below:

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import scipy as sc



N=50*1024

a=sc.arange(0,N).astype(sc.float32)
a_gpu = cuda.mem_alloc(a.nbytes) #allocate memory on GPU
cuda.memcpy_htod(a_gpu, a) #transfer data to the GPU

b=sc.array([i**2 for i in range(0,N)]).astype(sc.float32)
b_gpu = cuda.mem_alloc(b.nbytes)#allocate memory on GPU
cuda.memcpy_htod(b_gpu, b) #transfer data to the GPU

c=sc.zeros(N).astype(sc.float32)
c_gpu = cuda.mem_alloc(c.nbytes)#allocate memory on GPU


mod =SourceModule("""
   __global__ void add(int*a,int *b,int *c){
      int tid=threadIdx.x + blockIdx.x*gridDim.x;
        while (tid<N){
    c[tid]=a[tid]+b[tid];
    tid+=blockDim.x*gridDim.x;
         }
           }
            """)

#call the function(kernel)
func = mod.get_function("add")
func(a_gpu,b_gpu,c_gpu, block=(16,16,1),grid=(16,16))

#transfer data back to CPU
cuda.memcpy_dtoh(c, c_gpu)

but it gives me an error: "identifier "N" is undefined "

Thanks!

Solution

The way I use pycuda and the way I think it is intended to be used is as a bridge interface between python and cuda. It's not a python->c++ interface tool. For that you will have to look at something like SWIG. I wouldn't use pycuda inside c++ code to interface with a GPU, instead I would prototype or design my application using pycuda and later move it to using c++ only.

With that in mind I'll try to tackle you questions

1)With Pycuda you could also use the gpuarray module which will the allocation and transfer steps for you, so then you can just instantiate them and use them in the GPU:

import pycuda.gpuarray as gpuarray
a = gpuarray.arange(400, dtype=numpy.float32)
b = gpuarray.arange(400, dtype=numpy.float32)
#call Cuda function pass 'a' and 'b' 
resulta = a.get()
resultb = b.get()

2)Again, pycuda is not a c++ interface.If you need the results to go from cuda->python->c++, I don't think you need python in the middle.

3)Yes block(4,4,1) is 16 threads and grid(16,16) is 256 blocks.

Edit:

To answer some of your comments:

Yes a block(4,1,1) is one dimensional and block (4,4,1) is 2D.

I fixed your code, you just had to pass N to the CUDA kernel.

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import scipy as sc



N=50*1024

a=sc.arange(0,N).astype(sc.float32)
a_gpu = cuda.mem_alloc(a.nbytes) #allocate memory on GPU
cuda.memcpy_htod(a_gpu, a) #transfer data to the GPU

b=sc.array([i**2 for i in range(0,N)]).astype(sc.float32)
b_gpu = cuda.mem_alloc(b.nbytes)#allocate memory on GPU
cuda.memcpy_htod(b_gpu, b) #transfer data to the GPU

c=sc.zeros(N).astype(sc.float32)
c_gpu = cuda.mem_alloc(c.nbytes)#allocate memory on GPU


mod = SourceModule("""
   __global__ void add(int*a,int *b,int *c, int N){
      int tid=threadIdx.x + blockIdx.x*gridDim.x;
        while (tid<N){
    c[tid]=a[tid]+b[tid];
    tid+=blockDim.x*gridDim.x;
         }
           }
            """)

#call the function(kernel)
func = mod.get_function("add")
func(a_gpu,b_gpu,c_gpu, sc.int32(N), block=(16,16,1),grid=(16,16))

#transfer data back to CPU
cuda.memcpy_dtoh(c, c_gpu)
print c

Another way of doing this is to use string substitution on the SourceModule:

mod = SourceModule("""
   __global__ void add(int*a,int *b,int *c){
      const int N = %d;
      int tid=threadIdx.x + blockIdx.x*gridDim.x;
        while (tid<N){
    c[tid]=a[tid]+b[tid];
    tid+=blockDim.x*gridDim.x;
         }
           }
            """ % (N))

One last note is that, when you are using Pycuda, it generally works as the glue that connects all the different pieces of working with CUDA together. It helps you compile allocate memory, run your kernel etc... As long as you are using it like this you will be fine.