I am rendering a grass field. Each grass blade xz-position is stored in an instanced vertex buffer.
I was generating this positions using a halton sequence, hence the positions within this buffer were "random". Later on, I decided to generate them using chunks as Z-curves to improve data locality and prepare to implement frustum culling. Even without frustum culling implemented yet, performance improved, but only while looking at a specific direction (while the opposite direction actually got worse).
To make sure I wasn't going insane, I went back to the old branch which used the halton sequence to generate points and sorted them. This are the results:
Sort by | Look at -x | Look at x | Look at -z | Look at z |
---|---|---|---|---|
none | ~25 FPS | ~25 FPS | ~25 FPS | ~25 FPS |
a.x < b.x |
~15 FPS | ~30 FPS | ~20 FPS | ~20 FPS |
a.x > b.x |
~30 FPS | ~15 FPS | ~20 FPS | ~20 FPS |
a.z < b.z |
~20 FPS | ~20 FPS | ~15 FPS | ~30 FPS |
a.z > b.z |
~20 FPS | ~20 FPS | ~30 FPS | ~15 FPS |
This is running on Intel graphics (mesa). It might also be worth mentioning that I'm using Vulkan Memory Allocator.
Going back to the Z-curve implementation and adding frustum culling such that only the positions of the blades within the frustum are written to the buffer changed nothing. Furthermore, shuffling the blades "fixed" the issue, going back to a stable (but lower) FPS.
I also want to make clear that I'm not doing any conditional work based on the value of the view matrix (unless frustum culling is enabled), just the usual matrix/vector multiplication.
What could be causing all this?
This is a standard quirk of early depth testing (and other such tests such as hier-z). If you draw front-to-back so that the closest grass to the camera is drawn first, then later blades can be proven to be occluded and can be thrown away early.
Most GPUs also have some standard form of hidden-surface removal which doesn't rely on front-to-back render, but these schemes all have their quirks and will be less reliable than early-depth tests.
I would suggest coarsely sorting grass batches rather than individual blades - the sort won't be free on the CPU. However, it's unlikely you would need to fully resort every frame unless your camera is jumping around, which helps.