What is actually a Queue family in Vulkan?

I am currently learning Vulkan, right now I am just taking apart each command and inspecting the structures to try to understand what they mean.

Right now I am analyzing QueueFamilies, for which I have the following code:

vector<vk::QueueFamilyProperties> queue_families = device.getQueueFamilyProperties();
for(auto &q_family : queue_families)
{
    cout << "Queue number: "  + to_string(q_family.queueCount) << endl;
    cout << "Queue flags: " + to_string(q_family.queueFlags) << endl;
}

This produces this output:

Queue number: 16
Queue flags: {Graphics | Compute | Transfer | SparseBinding}
Queue number: 1
Queue flags: {Transfer}
Queue number: 8
Queue flags: {Compute}

So, naively I understand it like this:

There are 3 Queue families, one queue family has 16 queues, all capable of graphics, compute, transfer, and sparse binding operations (no idea what the last 2 are)

Another has 1 queue, capable only of transfer (whatever that is)

The final one has 8 queues capable of compute operations.

What is each queue family? I understand it's where we send execution commands like drawing and swapping buffers, but this is a somewhat broad explanation, I would like a more knowledgeable answer with more details.

What are the 2 extra flags? Transfer and SparseBidning?

And finally, why do we have/need multiple command queues?

Solution

To understand queue families, you first have to understand queues.

A queue is something you submit command buffers to, and command buffers submitted to a queue are executed in order^[1] relative to each other. Command buffers submitted to different queues are unordered relative to each other unless you explicitly synchronize them with VkSemaphore. You can only submit work to a queue from one thread at a time, but different threads can submit work to different queues simultaneously.

Each queue can only perform certain kinds of operations. Graphics queues can run graphics pipelines started by vkCmdDraw* commands. Compute queues can run compute pipelines started by vkCmdDispatch*. Transfer queues can perform transfer (copy) operations from vkCmdCopy*. Sparse binding queues can change the binding of sparse resources to memory with vkQueueBindSparse (note this is an operation submitted directly to a queue, not a command in a command buffer). Some queues can perform multiple kinds of operations. In the spec, every command that can be submitted to a queue have a "Command Properties" table that lists what queue types can execute the command.

A queue family just describes a set of queues with identical properties. So in your example, the device supports three kinds of queues:

One kind can do graphics, compute, transfer, and sparse binding operations, and you can create up to 16 queues of that type.
Another kind can only do transfer operations, and you can only create one queue of this kind. Usually, this is for asynchronously DMAing data between host and device memory on discrete GPUs, so transfers can be done concurrently with independent graphics/compute operations.
Finally, you can create up to 8 queues that are only capable of compute operations.

Some queues might only correspond to separate queues in the host-side scheduler, other queues might correspond to actual independent queues in hardware. For example, many GPUs only have one hardware graphics queue, so even if you create two VkQueues from a graphics-capable queue family, command buffers submitted to those queues will progress through the kernel driver's command buffer scheduler independently but will execute in some serial order on the GPU. But some GPUs have multiple compute-only hardware queues, so two VkQueues for a compute-only queue family might actually proceed independently and concurrently all the way through the GPU. Vulkan doesn't expose this.

The bottom line: decide how many queues you can usefully use, based on how much concurrency you have. For many apps, a single "universal" queue is all they need. More advanced ones might have one graphics+compute queue, a separate compute-only queue for asynchronous compute work, and a transfer queue for asynchronous DMA. Then map what you'd like onto what's available; you may need to do your own multiplexing, e.g. on a device that doesn't have a compute-only queue family, you might create multiple graphics+compute queues instead, or serialize your async compute jobs onto your single graphics+compute queue yourself.

^[1] Oversimplifying a bit. They start in order, but are allowed to proceed independently after that and complete out of order. Independent progress of different queues is not guaranteed though. I'll leave it at that for this question.