Search code examples

pycuda only block(N,1,1) works

I can't understand why the below program runs ok for block=N,1,1 but not for 1,1,N (result is invalid value ) or 1,N,1 ( result is 0,1,0.....0) or 10,50,1 (result is 0,1,0..0) (N=500).

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import random
from pycuda.curandom import rand
import cmath
import pycuda.driver as drv


mod =SourceModule("""
#include <cmath>

extern "C" {      

__global__  void myfunc(int *a,int N)

    int idx=threadIdx.x;   //+blockIdx.x*blockDim.x;

    if (idx<N) 



#call the function(kernel)
func = mod.get_function("myfunc")

func(a_gpu,np.int32(N), block=(N,1,1),grid=(1,1))

print("a = ",a)


Ok,i forgot that if i use int idx=threadIdx.y ,then i can use block(1,N,1) .

But , then , always must i use this arrangement block(N,1,1) ?

I must understand that! Thank you!


  • The first dimension corresponds to threadIdx.x, the second with threadIdx.y and the third with threadIdx.z

    When you launch block(N,1,1) threadIdx.x goes from 0 to N, while threadIdx.y and threadIdx.z are always zero.

    When you launch block(1, N, 1) threadIdx.x is always zero, threadIdx.y goes from 0 to N.

    so instead of having

    idx = threadIdx.x;

    Change it to

    idx = blockDim.x * threadIdx.y + threadIdx.x;

    or more accurately (only if using block(X, Y, Z) with Z > 1)

    idx = (blockDim.y * threadIdx.z +  threadIdx.y) * blockDim.x + threadIdx.x;