I can't understand why the below program runs ok for block=N,1,1 but not for 1,1,N (result is invalid value ) or 1,N,1 ( result is 0,1,0.....0) or 10,50,1 (result is 0,1,0..0) (N=500).
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import random
from pycuda.curandom import rand
import cmath
import pycuda.driver as drv
N=500
a_gpu=gpuarray.to_gpu(np.zeros(N).astype(np.int32))
mod =SourceModule("""
#include <cmath>
extern "C" {
__global__ void myfunc(int *a,int N)
{
int idx=threadIdx.x; //+blockIdx.x*blockDim.x;
if (idx<N)
a[idx]=idx;
}
}
""",no_extern_c=1)
#call the function(kernel)
func = mod.get_function("myfunc")
func(a_gpu,np.int32(N), block=(N,1,1),grid=(1,1))
a=a_gpu.get()
print("a = ",a)
--------------EDIT----------------------------------------
Ok,i forgot that if i use int idx=threadIdx.y ,then i can use block(1,N,1) .
But , then , always must i use this arrangement block(N,1,1) ?
I must understand that! Thank you!
The first dimension corresponds to threadIdx.x, the second with threadIdx.y and the third with threadIdx.z
When you launch block(N,1,1) threadIdx.x goes from 0 to N, while threadIdx.y and threadIdx.z are always zero.
When you launch block(1, N, 1) threadIdx.x is always zero, threadIdx.y goes from 0 to N.
so instead of having
idx = threadIdx.x;
Change it to
idx = blockDim.x * threadIdx.y + threadIdx.x;
or more accurately (only if using block(X, Y, Z) with Z > 1)
idx = (blockDim.y * threadIdx.z + threadIdx.y) * blockDim.x + threadIdx.x;