performance opengl time rendering frame-rate

Rendering Pipeline -- Performance -- Scaling with regard to amount of pixels

I had a discussion with a friend about two questions regarding the performance of the OpenGL Rendering Pipeline, and we would like to ask for help in determining who is right.

I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?
I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right? And if so, could someone explain how this works in practice?

Video

Illustration:

Solution

I argued that the throughput scales linearly with the amount of pixels involved, and therefore rendering a 4k scene should take 4 times as long as rendering a 1080p scene. Then we discovered this resolution-fps comparison video [see 1], and the scaling does not seem to be linear. Could someone explain why this is the case?

Remember: rendering happens in a pipeline. And rendering can only happen at the speed of the slowest part of that pipeline. Which part that is depends entirely on what you're rendering.

If you're shoving 2M triangles per frame at a GPU, and the GPU can only render 60M triangles per second, the highest framerate you will ever see is 30FPS. Your performance is bottlenecked on the vertex processing pipeline; the resolution you render to is irrelevant to the number of triangles in the scene.

Similarly, if you're rendering 5 triangles per frame, it doesn't matter what your resolution is; your GPU can chew that up in micro-seconds, and will be sitting around waiting for more. Your performance is bottlenecked on how much you're sending.

Resolution only scales linearly with performance if you're bottlenecked on the parts of the rendering pipeline that actually matter to resolution: rasterization, fragment processing, blending, etc. If those aren't your bottleneck, there's no guarantee that your performance will be impacted from increasing the resolution.

And it should be noted that modern high-performance GPUs require being forced to render a lot of stuff before they'll be bottlenecked on the fragment pipeline.

I argued that rendering a 1080p scene and rendering every 1/4 pixel in a 4k scene should have the same performance, as in both cases the same amount of pixels are drawn [see 2]. My friend argued that this is not the case, as adjunct pixel calculations can be done with one instructions. Is a right?

That depends entirely on how you manage to cause the system to "render every 1/4 pixel in a 4k scene". Rasterizers generally don't go around skipping pixels. So how do you intend to make the GPU pull off this feat? With a stencil buffer?

Personally, I can't imagine a way to pull this off without breaking SIMD, but I won't say it's impossible.

And if so, could someone explain how this works in practice?

You're talking about the very essence of Single-Instruction, Multiple Data (SIMD).

When you render a triangle, you execute a fragment shader on every fragment generated by the rasterizer. But you're executing the same fragment shader program on each of them. Each FS that operates on a fragment uses the same source code. They have the same "Single-Instructions".

The only difference between them is really the data they start with. Each fragment contains the interpolated per-vertex values provided by vertex processing. So they have "Multiple" sets of "Data".

So if they're all going to be executing the same instructions over different initial values... why bother executing them separately? Just execute them using SIMD techniques. Each opcode is executed on different sets of data. So you only have one hardware "execution unit", but that unit can process 4 (or more) fragments at once.

This execution model is basically why GPUs work.