I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool
cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false
and can be set by a CPU/host thread to true
during ongoing computation. The GPU/kernel threads then should exit.
This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested
).
On the host side, global definition (.cpp):
// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;
On the host side in the compute thread (.cpp):
initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);
On the host side in a main thread (.cpp):
cancel(cancelRequested); // Called after init is finished
Host routines (.cu file):
void initialize(volatile bool** pCancelRequested)
{
cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
const bool aFalse = false;
cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested)
{
....
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
cudaDeviceSynchronize(); // Non-busy wait
....
}
void finalize(volatile bool** pCancelRequested)
{
cudaFree(*const_cast<bool**>(pCancelRequested));
*pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
const bool aTrue = true;
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}
Device routines (.cu file):
__global__ void computeKernel(volatile bool* pCancelRequested)
{
while (someCondition)
{
// Computation step here
if (*pCancelRequested)
{
printf("-> Cancel requested!\n");
return;
}
}
}
The code runs fine. But it does never enter the cancel case. I read back the false
and true
values in initialize()
and cancel()
successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true
and exit normally from the outer while
loop.
Any idea why this doesn't work?
The fundamental problem I see with your approach is that cuda streams will prevent it from working.
CUDA streams have two basic principles:
Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.
I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series
Because of CUDA streams, this:
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
and this:
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy
operation to take place at the same time that the kernel is running.
We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:
$ cat t2184.cu
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag, *h_flag = new int;
cudaStream_t s[2];
cudaStreamCreate(s+0);
cudaStreamCreate(s+1);
cudaMalloc(&flag, sizeof(h_flag[0]));
*h_flag = 1;
cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
k<<<32, 256, 0, s[0]>>>(flag);
sleep(5);
*h_flag = 0;
cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
cudaDeviceSynchronize();
}
$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$
NOTES:
*h_flag = 0;
then the program will hang, indicating that the flag signal method is working correctly.volatile
. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag;
cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
*flag = 1;
k<<<32, 256>>>(flag);
sleep(5);
*flag = 0;
cudaDeviceSynchronize();
}
For other examples of signalling, such as from device to host, other readers may be interested in this.