As said here: How to reduce CUDA synchronize latency / delay
There are two approach for waiting result from device:
For "Polling" need to use CudaDeviceScheduleSpin
.
But for "Blocking" what do I need to use CudaDeviceScheduleYield
or cudaDeviceScheduleBlockingSync
?
What difference between cudaDeviceScheduleBlockingSync
and cudaDeviceScheduleYield
?
cudaDeviceScheduleYield
as written: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html
"Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?
For my understanding, both approaches use polling to synchronize. In pseudo-code for CudaDeviceScheduleSpin
:
while (!IsCudaJobDone())
{
}
whereas CudaDeviceScheduleYield
:
while (!IsCudaJobDone())
{
Thread.Yield();
}
i.e. CudaDeviceScheduleYield
tells the operating system that it can interrupt the polling thread and activate another thread doing other work. This increases the performance for other threads on CPU but also increases latency, in case the CUDA job finishes when another thread than the polling one is active in that very moment.