In Vulkan (or any other modern graphics API), should fences be waited per queue submission or per frame?

I am trying to set up my renderer in a way that rendering always renders into texture, then I just present any texture I like as long as its format is swapchain compatible. This means that, I need to deal with one graphics queue (I don't have compute yet) that renders the scene, ui etc; one transfer queue that copies the rendered image into swapchain; and one present queue for presenting the swapchain. This is a use-case that I am trying to tackle at the moment but I will be having more use-cases like this (e.g compute queues) as my renderer matures.

Here is a pseudocode on what I am trying to achieve. I added some of my own assumptions here as well:

// wait for fences per frame
waitForFences(fences[currentFrame]);
resetFences(fences[currentFrame]);

// 1. Rendering (queue = Graphics)
commandBuffer.begin();
renderEverything();
commandBuffer.end();

QueueSubmitInfo renderSubmit{};
renderSubmit.commandBuffer = commandBuffer;

// Nothing to wait for
renderSubmit.waitSemaphores = nullptr;

// Signal that rendering is complete
renderSubmit.signalSemaphores = { renderSemaphores[currentFrame] };

// Do not signal the fence yet
queueSubmit(renderSubmit, nullptr);

// 2. Transferring to swapchain (queue = Transfer)

// acquire the image that we want to copy into
// and signal that it is available
swapchain.acquireNextImage(imageAvailableSemaphore[currentFrame]);

commandBuffer.begin();
copyTexture(textureToPresent, swapchain.getAvailableImage());
commandBuffer.end();

QueueSubmitInfo transferSubmit{};
transferSubmit.commandBuffer = commandBuffer;

// Wait for swapchain image to be available
// and rendering to be complete
transferSubmit.waitSemaphores = { renderSemaphores[currentFrame], imageAvailableSemaphore[currentFrame] };

// Signal another semaphore that swapchain
// is ready to be used
transferSubmit.signalSemaphores = { readyForPresenting[currentFrame] };

// Now, signal the fence since this is the end of frame
queueSubmit(transferSubmit, fences[currentFrame]);

// 3. Presenting (queue = Present)
PresentQueueSubmitInfo presentSubmit{};

// Wait until the swapchain is ready to be presented
// Basically, waits until the image is copied to swapchain
presentSubmit.waitSemaphores = { readyForPresenting[currentFrame] };

presentQueueSubmit(presentSubmit);

My understanding is that fences are needed to make sure that the CPU waits until GPU is done submitting the previous command buffer to the queue.

When dealing with multiple queues, is it enough to make the CPU wait only for the frame and synchronize different queues with semaphores (pseudocode above is based on this)? Or should each queue wait for a fence separately?

To get into technical details, what will happen if two command buffers are submitted to the same queue without any semaphores? Pseudocode:

// first submissions
commandBufferOne.begin();
doSomething();
commandBufferOne.end();

SubmitInfo firstSubmit{};
firstSubmit.commandBuffer = commandBufferOne;
queueSubmit(firstSubmit, nullptr);

// second submission
commandBufferTwo.begin();
doSomethingElse();
commandBufferTwo.end();

SubmitInfo secondSubmit{};
secondSubmit.commandBuffer = commandBufferOne;
queueSubmit(secondSubmit, nullptr);

Will the second submission overwrite the first one or will the first FIFO queue be executed before the second one since it was submitted first?

Solution

This entire organizational scheme seems dubious.

Even ignoring the fact that the Vulkan specification does not require GPUs to offer separate queues for all of these things, you're spreading a series of operations across asynchronous execution, despite the fact that these operations are inherently sequential. You cannot copy from an image to the swapchain until the image has been rendered, and you cannot present the swapchain image until the copy has completed.

So there is basically no advantage to putting these things into their own queues. Just do all of them on the same queue (with one submit and one vkQueuePresentKHR), using appropriate execution and memory dependencies between the operations. This means there's only one thing to wait on: the single submission.

Plus, submit operations are really expensive; doing two submits instead of one submit containing both pieces of work is only a good thing if the submissions are being done on different CPU threads that can work concurrently. But binary semaphores stop that from working. You cannot submit a batch that waits for semaphore A until you have submitted a batch that signals semaphore A. This means that the batch signaling must either be earlier in the same submit command or must have been submitted in a prior submit command. Which means if you put those submits on different threads, you have to use a mutex or something to ensure that the signaling submit happens-before the waiting submit.¹

So you don't get any asynchronous execution of the queue submit operation. So neither the CPU nor the GPU will asynchronously execute any of this.

¹: Timeline semaphores don't have this problem.

As for the particulars of your technical question, if operation A is dependent on operation B, and you synchronize with A, you have also synchronized with B. Since your transfer operation is waits on a signal from the graphics queue, waiting on the transfer operation will also wait on graphics commands from before that signal.