concurrency cuda gpgpu latency synchronize

What difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield?

As said here: How to reduce CUDA synchronize latency / delay

There are two approach for waiting result from device:

"Polling" - burn CPU in spin - to decrease latency when we wait result
"Blocking" - thread is sleeping until an interrupt occurs - to increase general performance

For "Polling" need to use CudaDeviceScheduleSpin.

But for "Blocking" what do I need to use CudaDeviceScheduleYield or cudaDeviceScheduleBlockingSync?

What difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield?

cudaDeviceScheduleYield as written: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html "Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?

Solution

For my understanding, both approaches use polling to synchronize. In pseudo-code for CudaDeviceScheduleSpin:

while (!IsCudaJobDone())
{
}

whereas CudaDeviceScheduleYield:

while (!IsCudaJobDone())
{
     Thread.Yield();
}

i.e. CudaDeviceScheduleYield tells the operating system that it can interrupt the polling thread and activate another thread doing other work. This increases the performance for other threads on CPU but also increases latency, in case the CUDA job finishes when another thread than the polling one is active in that very moment.