Reading global flag does not work for CPU>GPU data exchange in CUDA

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;

On the host side in the compute thread (.cpp):

initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}

void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}

void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast<bool**>(pCancelRequested));
   *pCancelRequested = nullptr;
}

void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here

      if (*pCancelRequested)
      {
         printf("-> Cancel requested!\n");
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

Solution

The fundamental problem I see with your approach is that cuda streams will prevent it from working.

CUDA streams have two basic principles:

Items issued into the same stream will not overlap; they will serialize.
Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.

Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.

I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series

Because of CUDA streams, this:

 computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);

and this:

 cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);

could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.

We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:

$ cat t2184.cu
#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k<<<32, 256, 0, s[0]>>>(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}

$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

NOTES:

Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.

This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:

#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag;
  cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k<<<32, 256>>>(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

For other examples of signalling, such as from device to host, other readers may be interested in this.