Search code examples
performanceopenglglsltextures

How to improve texture access performance in OpenGL shaders?


Conditions

I use OpenGL 3 and PyOpenGL.

I have ~50 thousand (53'490) vertices and each of them has 199 vec3 attributes which determine their displacement. It's impossible to store this data as regular vertices attributes, so I use texture.

The problem is: non-parallelized C function calculates displacement of vertices as fast as GLSL and even faster in some cases. I've checked: the issue is texture read and I don't understand how to optimize it.

I've written two different shaders. One calculates new model in ~0.09s and another one in ~0.12s (including attributes assignment, which is equal for both cases).

Code

Both shaders start with

#version 300 es

in vec3 vin_position;

out vec4 vin_pos;

uniform mat4 rotation_matrix;

uniform float coefficients[199];

uniform sampler2D principal_components;

The faster one is

void main(void) {
    int c_pos = gl_VertexID;
    int texture_size = 8192;
    ivec2 texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
    vec4 tmp = vec4(0.0);
    for (int i = 0; i < 199; i++) {
        tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
        c_pos += 53490;
        texPos = ivec2(c_pos % texture_size, c_pos / texture_size);
    }
    gl_Position = rotation_matrix
        * vec4(vin_position + tmp.xyz, 246006.0);
    vin_pos = gl_Position;
}

The slower one

void main(void) {
    int texture_size = 8192;
    int columns = texture_size - texture_size % 199;
    int c_pos = gl_VertexID * 199;
    ivec2 texPos = ivec2(c_pos % columns, c_pos / columns);
    vec4 tmp = vec3(0.0);
    for (int i = 0; i < 199; i++) {
        tmp += texelFetch(principal_components, texPos, 0) * coefficients[i];
        texPos.x++;
    }
    gl_Position = rotation_matrix
        * vec4(vin_position + tmp.xyz, 246006.0);
    vin_pos = gl_Position;
}

The main idea of difference between them:

  • in the first case attributes of vertices are stored in following way:
    • first attributes of all vertices
    • second attributes of all vertices
    • ...
    • last attributes of all vertices
  • in the second case attributes of vertices are stored in another way:
    • all attributes of the first vertex
    • all attributes of the second vertex
    • ...
    • all attributes of the last vertex
  • also in the second example data is aligned so that all attributes of each vertex stored only in one row. This means that if I know the row and column of the first attribute of some vertex, I need only to increment x component of texture coordinate

I thought, that aligned data will be accessed faster.

Questions

  • Why is data not accessed faster?
  • How can I increase performance of it?
  • Is there ability to link texture chunk with vertex?
  • Are there recommendations for data alignment, good related article about caching in GPUs (Intel HD, nVidia GeForce)?

Notes

  • coefficients array changed from frame to frame, otherwise there's no problem: I could precalculate the model and be happy

Solution

  • Why is data not accessed faster?

    Because GPUs are not magical. GPUs gain performance by performing calculations in parallel. Performing 1 million texel fetches, no matter how it happens, is not going to be fast.

    If you were using the results of those textures to do lighting computations, it would appear fast because the cost of the lighting computation would be hidden by the latency of the memory fetches. You are taking the results of a fetch, doing a multiply/add, then doing another fetch. That's slow.

    Is there ability to link texture chunk with vertex?

    Even if there was (and there isn't), how would that help? GPUs execute operations in parallel. That means multiple vertices are being processed simultaneously, each accessing 200 textures.

    So what would aid performance there is making each texture access coherent. That is, neighboring vertices would access neighboring texels, thus making the texture fetches more cache efficient. But there's no way to know what vertices will be considered "neighbors". And texture swizzle layouts are implementation dependent, so even if you did know the order of vertex processing, you couldn't adjust your texture to take local advantage of it.

    The best way to do that would be to ditch vertex shaders and texture accesses in favor of compute shaders and SSBOs. That way, you have direct knowledge of the locality of your accesses, by setting the work group size. With SSBOs, you can arrange your array in whatever fashion gives you the best locality of access for each wavefront.

    But things like this are the equivalent of putting band-aids on a gaping wound.

    How can I increase performance of it?

    Stop doing so many texture fetches.

    I'm being completely serious. While there are ways to mitigate the costs of what you're doing, the most effective solution is to change your algorithm so that it doesn't need to do that much work.

    Your algorithm looks suspiciously like vertex morphing via a palette of "poses", with the coefficient specifying the weight applied to each pose. If that's the case, then odds are good that most of your coefficients are either 0 or negligibly small. If so, then you're wasting vast amounts of time accessing textures only to transform their contributions into nothing.

    If most of your coefficients are 0, then the best thing to do would be to pick some arbitrary and small number for the maximum number of coefficients that can affect the result. For example, 8. You send an array of 8 indices and coefficients to the shader as uniforms. Then you walk that array, fetching only 8 times. And you might be able to get away with just 4.