In OpenCL, what's the difference between "host" and "device" command-queues?

In OpenCL, when creating a command queue, we can set the options to indicate that this will be a "device" command queue; otherwise, it's a "host" queue. The C++ bindings have CommandQueue class and a DeviceCommandQueue, but it seems most of the enqueueXYZ() methods are in CommandQueue.

Coming from CUDA programming, I only know of a single kind of command queue, the "CUDA stream"; and it can be synchronous or asynchronous (which is also an option for clCreateCommandQueueWithProperties) - but there is no "host stream".

So, what are "device" command queues and "host" or "non-device" command queues in OpenCL? i.e. what should I use each of them for, and how do they interact (if at all)?

Note: Assume OpenCL 3.0 with typical extensions if that's relevant (although it probably shouldn't be)

Solution

In OpenCL, device queue is dynamic parallelism related queue that is filled by gpu and consumed by gpu to compute some algorithm with data-dependent complexity with flexible amount of threads without CPU interaction.

Using multiple command queues with enqueueXYZ() is similar to using multiple CUDA streams. But to be sure to not cause any implementation-specific data sharing / race-condition issues, each kernel object instance could be a unique one when setting different parameters per kernel call. But in CUDA, when using cudaLaunchKernel function, all of the states are sent to gpu in 1 go, rather than changing a state in the gpu. CUDA launches kernel stateless, only depend on during the launch command but OpenCL launches using a state stored in GPU (edited by clSetKernelArg that is not using a command queue)

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE means a queue with this parameter will not have order in its enqueued commands. That's why device-side queue uses this as default. They run by gpu if there's space in hardware to run it. As long as the commands do not collide with each other, like race conditions, it is expected to work without issue. Host queues can be used like this too.

But in an in-order host queue, they run one by one on gpu. Asynchronously to cpu but still have order between them.

Host command queues can use events and checkpoints to become partially dependent to each other, to build graphs but without the ability to re-play (that CUDA has).