For example, I can launch kernels with offsets on thread ids in OpenCL so when using multiple GPUs, second GPU threads can directly start from an arbitrary integer value:
GPU1: thread idx 0 to k
GPU2: thread idx k+1 to N
and kernel can look exactly same for all GPUs:
__kernel(..)
{
int threadId=get_global_id(0); // this starts from k+1 for GPU2
..
}
What is equivalent of this in CUDA when calling kernels in driver API?
OpenCL has
cl_int clEnqueueNDRangeKernel ( cl_command_queue command_queue,
cl_kernel kernel,
cl_uint work_dim,
const size_t *global_work_offset, // this can be given k+1
const size_t *global_work_size,
const size_t *local_work_size,
cl_uint num_events_in_wait_list,
const cl_event *event_wait_list,
cl_event *event)
Do I have to embed the offset in kernel myself such as below example to use most similarly looking kernel on multipe GPUs?
__global__ void vecAdd(.., int * gpuParams)
{
int offset=gpuParams[0];
int threadId = offset + blockIdx.x * blockDim.x + threadIdx.x;
}
(this is for flattened buffer accesses and kernels)
This offset will be changed dynamically before each kernel launch so I can't use #define for it I guess.
Cuda driver api has this for launching kernel:
CUresult cuLaunchKernel ( CUfunction f, unsigned int gridDimX,
unsigned int gridDimY, unsigned int gridDimZ,
unsigned int blockDimX, unsigned int blockDimY,
unsigned int blockDimZ, unsigned int sharedMemBytes,
CUstream hStream, void** kernelParams, void** extra )
There is no equivalent functionality in CUDA for this, but you don't really need it either.
In the standard CUDA model you cannot launch kernels on multiple devices with a single API call (and you technically can't allocate memory on multiple devices with a single API call either). Each GPU is effectively independent.
NVIDIA have introduced a multi-device launch and synchronization approach as part of the new cooperative groups features in CUDA 9, but these are entirely different from what you are asking about.