Search code examples
c++memorycudatransfercuda-streams

CUDA C++ overlapping SERIAL kernel execution and data transfer


So this guide here shows the general way to overlap kernel execution and data transfer.

cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
  cudaStreamCreate(&streams[i]);
  int offset = ...;
  cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
  kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
  // edit: no deviceToHost copy
}

However, the kernel is serial. So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:

  • copy[a->b] must happen before kernel[a->b]
  • kernel [a->b] must happen before kernel[b->c], where c > a, b

Is it possible to do this without using cudaDeviceSynchronize() ? If not, what's the fastest way to do it?


Solution

  • So each kernel is dependent on (cannot begin until):

    1. The associated H->D copy is complete
    2. The previous kernel execution is complete

    Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there. We can use a set of events and cudaStreamWaitEvent() to handle it.

    For the most general case (no knowledge of the total number of chunks) I would recommend something like this:

    $ cat t1783.cu
    #include <iostream>
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    template <typename T>
    __global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){
    
      for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
        out[i] = in[i] + prev[i];
        }
    }
    const int nTPB = 256;
    typedef int mt;
    const int chunk_size = 1048576;
    const int data_size = 10*1048576;
    const int ns = 3;
    
    int main(){
    
      mt *din, *dout, *hin, *hout;
      cudaStream_t str[ns];
      cudaEvent_t  evt[ns];
      for (int i = 0; i < ns; i++) {
        cudaStreamCreate(str+i);
        cudaEventCreate( evt+i);}
      cudaMalloc(&din, sizeof(mt)*data_size);
      cudaMalloc(&dout, sizeof(mt)*data_size);
      cudaHostAlloc(&hin,  sizeof(mt)*data_size, cudaHostAllocDefault);
      cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
      cudaMemset(dout, 0, sizeof(mt)*chunk_size);  // for first loop iteration
      for (int i = 0; i < data_size; i++) hin[i] = 1;
      cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
      unsigned long long dt = dtime_usec(0);
      for (int i = 0; i < (data_size/chunk_size); i++){
        cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
        cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
        cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
        process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
        cudaEventRecord(evt[i%ns]);
        cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
        }
      cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
      std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
    }
    $ nvcc -o t1783 t1783.cu
    $ ./t1783
    elapsed time: 4366 microseconds
    

    Good practice here would be to use a profiler to verify the expected overlap scenarios. However, we can take a shortcut based on the elapsed time measurement.

    The loop is transferring a total of 40MB of data to the device, and 40MB back. The elapsed time is 4366us. This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s. This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies. The kernel here is trivial so it shows up as just slivers in the profile.

    For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it. The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).

    A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device. So it is not possible to launch more work asynchronously than that. If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).

    The code is presented to illustrate an approach, conceptually. It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.