Deferred Rendering vs pre fragment shader Depth Culling

The purported advantage of Deferred Rendering is that it allows the fragment shader to be executed only numLights * numPixels times. This is because in forward rendering, many times the same pixel is rendered twice. However, what if depth culling or z-culling was performed before the fragment shader and after the geometry shader, this would cause it to be as efficient as deferred rendering, and without the need for a large G-Buffer. I believe it is possible to do this in newer versions of OpenGL, so why does no one do this? Please inform me if there is an error in my thinking.

Note: perhaps this isn't a proper question, but I am unaware of where else I could post this.

Solution

However, what if depth culling or z-culling was performed before the fragment shader and after the geometry shader, this would cause it to be as efficent as defered rendering

No, it would not.

You would still be executing all of the vertex processing stages once per light, since standard forward rendering requires drawing each model for each light. In deferred rendering, you only draw each model once, so you save vertex processing time.

That's important because lots of modern hardware uses so-called "unified shader architecture". This means that the hardware is capable of assigning shader workloads dynamically. If you render a lot of vertices, but with a short fragment shader, then it can assign more shader units to the VS stages. If you render 4 vertices with a simple VS, but a complex FS, then it can assign little hardware to the vertex stage, and more to the fragment shaders.

So in deferred rendering, you're spending more of the available shader processor time doing useful work. With forward rendering, you're splitting the GPU's resources betweeen the useful FS work and the repeated and useless VS work.

Next, you might suggest using transform feedback to capture the results of rendering and re-render them. But now you're just trading one bottleneck for another: you take up more memory, and you exchange VS work for memory bandwidth.

I perform a draw call only once so there are no redundant geomitries, and I instead loop through all of the lights in the fragment shader using an array of lights as a uniform.

That simply trades one inefficiency for another, though this is generally faster (yet less flexible). Instead of having to render each mesh a bunch of times, you're running very time-consuming FS invocations for fragments that aren't necessarily visible.

Early depth tests don't guarantee anything. It is impossible for the hardware to guarantee that every FS invocation executed will be visible, since the GPU may not have rasterized the occluding object yet. Early depth tests don't determine the order things get drawn in.

The only way to guarantee that all FS invocations are visible is to perform a depth pre-pass. That is, you render everything in the scene, but without a fragment shader. So the only thing that gets updated is the depth. Then, you render the scene again, with your actual shader.

But then, you're rendering the entire scene twice. And shader load balancing is probably much harder, since entire objects are being culled.

The other thing you have to recognize is that deferred rendering is no guarantee of performance. It is simply another method of rendering, with its own benefits and drawbacks. You can create circumstances where deferred rendering is slower than any variation of forward rendering. You can also create circumstances where deferred rendering is faster. While deferred rendering is generally the best bet as the number of lights increase, there are always circumstances that can make another rendering technique superior in performance.