Vulkan Z pre-pass and MSAA

I recently implemented a Z pre-pass in Vulkan, a single renderpass with two subpasses, in order to reduce overdraw on a scene with thousands of grass blades. Previously to that, a was sorting the grass blades by distance to the camera. The Z pre-pass solution is much better at reducing overdraw and improves performance significantly, if and only if, MSAA is disabled. With MSAA, the Z pre-pass solution performs significantly worse, which is something I wasn't expecting since I'm using the exact number of attachments. My guess is that MSAA is being performed twice on the depth buffer (both subpasses).

I was thinking of using VkSubpassDescriptionDepthStencilResolve to perform MSAA on the depth buffer on the first subpass only, but it's not allowed to have a depth image with less samples than the color image.

Is MSAA for the depth buffer being performed twice? If so, is there something that can be done?

It might be worth mentioning that I'm rendering with a desktop CPU.

Thank you.

Solution

When you're doing multisampled rendering, you are rendering to buffers that, for each pixel, contain multiple samples. The point of multisampled rendering is that the rasterizer may generate fewer fragments per pixel than the sample count. It simply copies the color values to multiple samples within that pixel.

When it comes to depth however, unless your shader modifies the depth value, multisampled rendering is processed per-sample. That's kinda the point of it. So while you save fragment shader execution time, you're not saving time relative to the depth buffer. Per-sample, you have a full read-modify-write for the depth buffer.

Therefore, as you increase the sample count, the cost of depth tests using multisampling increases linearly, even though fragment shader execution costs may not. 4x MSAA costs 4x as many depth tests.

The job of depth pre-pass is to ensure that fragment shaders are only executed when absolutely necessary, when that sample will certainly be visible. That's great, but all those failed depth tests aren't free. If you're doing 4x as many depth tests, you have 4x as many failures.

So that's likely what's happening here: you reduced the cost of fragment shaders so much that the cost of the depth test (and associated framebuffer bandwidth) is dominating your performance.

You can't render with multisampled color buffers and non-multisampled depth buffers.