How should Staging Buffer be used performance-wise properly?

How should the Vulkan API's staging buffer be used properly? Because saving your vertex data into the staging buffer, than copied to the vertex buffer in the GPU, seems to be taking a longer step than just directly submitting your vertices to your vertex buffer. This is a Minecraft-clone program so there will be a lot of vertex data (with index data too) and dynamic chunk loading, so is there any other kinds of buffer or a method of buffering that might benefit from this?

Even then using a separate device threading or even cross-device threading seems to be slower than just directly submitting your vertices to the vertex buffer on the fly. And I currently do not clearly understand the pros and cons of using the traditional direct vertex buffer versus the staging buffer.

The tutorial I'm currently following uses Staging Buffer once before the drawings and presentation. There seems to be a lack of forums or articles discussing the problem precisely described above.

Solution

The exact mechanics one would use to achieve high performance would depend heavily on both the details of the hardware and the expected frequency of data updates.

Staging buffers are only relevant for GPUs that have multiple pools of device memory, called heaps. Integrated GPUs typically only have one heap, so there's no point in staging vertex data (textures still need staging because of tiling).

So in a device with more than one heap, the first thing you need to find out is what your options are. In multi-memory GPUs (aka: GPUs that have their own memory), one or more of the heaps will be marked DEVICE_LOCAL. This is meant to represent memory which has the fastest possible access time for GPU operations.

But device-local memory is (usually) not directly accessible from the CPU; hence the need for staging.

However, memory that isn't device-local may be able to be used for GPU tasks. That is, a GPU may be able to read from CPU-accessible memory directly for certain kinds of operations. You can ask if a particular memory type can be used as the source memory for vertex data.

If CPU-accessible, non-device-local memory can be used for vertex data, then you now have a real choice: which heap to read from? Do you read vertex data across the PCI-e bus? Or do you transfer the vertex data to faster memory, then read from it?

In cases where vertex data is being dynamically generated every frame, I would probably say that you should default to not staging (but you should still profile it on applicable hardware).

And if you were doing some kind of data streaming, where you're loading world data as the camera moves through a scene, then I would say that the best thing to do would be to transfer it to device-local memory. You're using the data much more frequently than you're doing transfer operations, and you should be able to dump whole chunks of data in a few transfer calls.

But your use case is about intermittent data generation. Even if a player is constantly placing blocks, you're likely to be reusing the same data to render several frames. And even when a block is placed, you're only changing that one subsection of data. Each transfer operation has some degree of overhead to it, so doing a bunch of small transfers can make things sluggish.

As such, it's hard to say which is better; you should profile it on a variety of hardware to see what the performance is like.

Also, be advised that some discrete GPUs will have a special heap of around 256MB. It's special because it is both CPU-accessible and device-local. Presumably, there is some fast channel for the CPU to write its data to this device memory. This memory heap is designed for streaming usage, so it would be pretty good for your needs (assuming its size is adequate; the size tends to be around 256MB regardless of the GPU's total memory).