Search code examples
cudaopencloffset

How does CUDA handle thread offset as in OpenCL?


For example, I can launch kernels with offsets on thread ids in OpenCL so when using multiple GPUs, second GPU threads can directly start from an arbitrary integer value:

GPU1: thread idx 0 to k
GPU2: thread idx k+1 to N

and kernel can look exactly same for all GPUs:

__kernel(..)
{
   int threadId=get_global_id(0); // this starts from k+1 for GPU2
   ..
}

What is equivalent of this in CUDA when calling kernels in driver API?

OpenCL has

cl_int clEnqueueNDRangeKernel ( cl_command_queue command_queue,
    cl_kernel kernel,
    cl_uint work_dim,
    const size_t *global_work_offset, // this can be given k+1
    const size_t *global_work_size,
    const size_t *local_work_size,
    cl_uint num_events_in_wait_list,
    const cl_event *event_wait_list,
    cl_event *event)

Do I have to embed the offset in kernel myself such as below example to use most similarly looking kernel on multipe GPUs?

__global__ void vecAdd(.., int * gpuParams)
{
    int offset=gpuParams[0];
    int threadId = offset + blockIdx.x * blockDim.x + threadIdx.x;
}

(this is for flattened buffer accesses and kernels)

This offset will be changed dynamically before each kernel launch so I can't use #define for it I guess.

Cuda driver api has this for launching kernel:

CUresult cuLaunchKernel ( CUfunction f, unsigned int  gridDimX, 
                unsigned int  gridDimY, unsigned int  gridDimZ, 
                unsigned int  blockDimX, unsigned int  blockDimY, 
                unsigned int  blockDimZ, unsigned int  sharedMemBytes, 
                CUstream hStream, void** kernelParams, void** extra )

Solution

  • There is no equivalent functionality in CUDA for this, but you don't really need it either.

    In the standard CUDA model you cannot launch kernels on multiple devices with a single API call (and you technically can't allocate memory on multiple devices with a single API call either). Each GPU is effectively independent.

    NVIDIA have introduced a multi-device launch and synchronization approach as part of the new cooperative groups features in CUDA 9, but these are entirely different from what you are asking about.