How to measure execution time of Vulkan pipeline

Summary

I wish to be able to measure time elapsed in milliseconds, on the GPU, of running the entire graphics pipeline. The goal: To be able to save benchmarks before/after optimizing the code (next step would be mipmapping textures) to see improvements. This was really simple in OpenGL, but I'm new to Vulkan, and could use some help.

I have browsed related existing answers (here and here), but they aren't really of much help. And I cannot find code samples anywhere, so I dare ask here.

Through documentation pages I have found a couple of functions that I think I should be using, so I have in place something like this:

1: Creating query pool

void CreateQueryPool()
{
    VkQueryPoolCreateInfo createInfo{};
    createInfo.sType = VK_STRUCTURE_TYPE_QUERY_POOL_CREATE_INFO;
    createInfo.pNext = nullptr; // Optional
    createInfo.flags = 0; // Reserved for future use, must be 0!

    createInfo.queryType = VK_QUERY_TYPE_TIMESTAMP;
    createInfo.queryCount = mCommandBuffers.size() * 2; // REVIEW

    VkResult result = vkCreateQueryPool(mDevice, &createInfo, nullptr, &mTimeQueryPool);
    if (result != VK_SUCCESS)
    {
        throw std::runtime_error("Failed to create time query pool!");
    }
}

I had the idea of queryCount = mCommandBuffers.size() * 2 to have space for a separate query timestamp before and after rendering, but I have no clue whether this assumption is correct or not.

2: Recording command buffers

// recording command buffer i:
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i);
// render pass ...
vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i);

vkCmdCopyQueryPoolResults(/* many parameters here */);

I'm looking for a couple of clarifications:

What is the concequence of writing to the same query index? Do I need two separate query pools - one for before render time and one for after render time?
How should I handle synchronization? I assume having a separate query for each command buffer.
For the destination buffer containing the query result, is it good enough to store somewhere with "host visible bit", or do I need staging memory for "device visible only"? I'm a bit lost on this one as well.

I have not been able to find any online examples of how to measure render time, but I just assume it's such a common task that surely there must be an example out there somewhere.

Solution

So, thanks to @karlschultz, I managed to get something working. So in case other people will be looking for the same answer, I decided to post my findings here. For the Vulkan experts out there: Please let me know if I make obvious mistakes, and I will correct them here!

Query Pool Creation

I fill out a VkQueryPoolCreateInfo struct as described in my question, and let its queryCount field equal twice the number of command buffers, to store space for a query before and after rendering.

Important here is to reset all entries in the query pool before using the queries, and to reset a query after writing to it. This necessitates a few changes:

1) Asking graphics queue if timestamps are supported

When picking the graphics queue family, the struct VkQueueFamilyProperties has a field timestampValidBits which must be greater than 0, otherwise the queue family cannot be used for timestamp queries!

2) Determining the timestamp period

The physical device contains a special value which indicates the number of nanoseconds it takes for a timestamp query to be incremented by 1. This is necessary to interpret the query result as e.g. nanoseconds or milliseconds. That value is a float, and can be retrieved by calling vkGetPhysicalDeviceProperties and looking at the field VkPhysicalDeviceProperties.limits.timestampPeriod.

3) Asking for query reset support

During logical device creation, one must fill out a struct and add it to the pNext chain to enable the host query reset feature:

VkDeviceCreateInfo createInfo{};
VkPhysicalDeviceHostQueryResetFeatures resetFeatures;
resetFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_HOST_QUERY_RESET_FEATURES;
resetFeatures.pNext = nullptr;
resetFeatures.hostQueryReset = VK_TRUE;

createInfo.pNext = &resetFeatures;

4) Recording command buffers

Timestamp queries should be outside the scope of the render pass, as seen below. It is not possible to measure running time of a single shader (e.g. fragment shader), only the entire pipeline or whatever is outside the scope of the render pass, due to (potential) temporal overlap of pipeline stages.

vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT, mTimeQueryPool, i * 2);

vkCmdBeginRenderPass(/* ... */);

// render here...

vkCmdEndRenderPass(mCommandBuffers[i]);

vkCmdWriteTimestamp(mCommandBuffers[i], VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT, mTimeQueryPool, i * 2 + 1);

5) Retrieving query result

We have two methods for this: vkCmdCopyQueryPoolResults and vkGetQueryPoolResults. I chose to go with the latter since is greatly simplifies the setup and does not require synchronization with GPU buffers.

Given that I have a swapchain index (in my scenario same is command buffer index!), I have a setup like this:

void FetchRenderTimeResults(uint32_t swapchainIndex)
{
    uint64_t buffer[2];

    VkResult result = vkGetQueryPoolResults(mDevice, mTimeQueryPool, swapchainIndex * 2, 2, sizeof(uint64_t) * 2, buffer, sizeof(uint64_t),
    VK_QUERY_RESULT_64_BIT);
    if (result == VK_NOT_READY)
    {
        return;
    }
    else if (result == VK_SUCCESS)
    {
        mTimeQueryResults[swapchainIndex] = buffer[1] - buffer[0];
    }
    else
    {
        throw std::runtime_error("Failed to receive query results!");
    }

    // Queries must be reset after each individual use.
    vkResetQueryPool(mDevice, mTimeQueryPool, swapchainIndex * 2, 2);
}

The variable mTimeQueryResults refers to an std::vector<uint64_t> which contains a result for each swapchain. I use it to calculate an average rendering time each second by using the timestamp period determined in step 2).

And one must not forget to cleanup to query pool by calling vkDestroyQueryPool.

There are a lot of details omitted, and for a total Vulkan noob like me this setup was frightening and took several days to figure out. Hopefully this will spare someone else the headache.