Search code examples
parallel-processingcudathrust

Dividing jobs for threads in Cuda using Thrust


I have a testing code that needs to update keys inside a device_vector of a class. Therefore, how do I divide portions of the work to especific threads?

Example of the code without the division:

__global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
   curandState_t state;
   curand_init(seed, threadIdx.x, 0, &state);
   int id = blockIdx.x * blockDim.x + threadIdx.x;
   if(id < size){
       vector[i].key_ = (curand(&state % 100) / 100;
   }
}

That vector is passed as a thrust::device_vector.

Examples of what I want:

1000 keys and 2000 threads: use only 1000 and give a key to each one.
1000 keys and 1000 threads: use it all.
1 key and 100 threads: use 1 thread.
500 keys and 250 threads: each thread take care of 2.
240 keys and 80 threads: each thread take care of 3.


Solution

  • If you modify your basic kernel structure like this:

    __global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
       curandState_t state;
       curand_init(seed, threadIdx.x, 0, &state);
       int id = blockIdx.x * blockDim.x + threadIdx.x;
       int gid = blockDim.x * gridDim.x;
       for(; id < size; id += gid){
           vector[id].key_ = (curand(&state) % 100) / 100;
       }
    }
    

    then it should be possible for any legal one dimensional block size (and number of one dimensional blocks) to process as many or as few inputs as you choose to provide via the size parameter. If you run more threads than keys, some threads will do nothing. If you run less threads than keys, some threads will process multiple keys.