Optimizing Min/Max Depth GLSL Shader

I am implementing tiled deferred shading, and for that I need to compute the minimum/maximum depth values of a tile. I'm rendering 1 pixel per tile, and collect the depth values in a nested for loop, like this:

float minDepth = 1.0;
float maxDepth = 0.0;

ivec2 clampMax = ivec2(screenSize) - 1;

// Iterate over each pixel in this tile
for (int x = 0; x < 32; x++) {
    for (int y = 0; y < 32; y++) {
        ivec2 newCoord = screenCoord + ivec2(x,y);
        newCoord = min(newCoord, clampMax);

        // Fetch the depth for that coordinate
        float currentDepth = texelFetch(depth, newCoord, 0).r;

        minDepth = min(minDepth, currentDepth);
        maxDepth = max(maxDepth, currentDepth);
    }
}

This works fine so far, but looking at the generated assembly, the texture lookup gets something like this:

// R2.xy contains 'newCoord'
MOV.S R2.z, {0, 0, 0, 0}.x;
TXF.F R1.x, R2.xyzz, handle(D0.x), 2D;

Which basically is equal to:

vec3 coordinate;
coordinate.xy = newCoord;
coordinate.z = 0;
result = texelFetch(depth, coordinate);

So it generates one extra unecessary instruction for the texture lookup, which sums up a lot in such a loop. My guess is, that NVIDIA internally implemented texelFetch as

texelFetch(sampler2D sampler, ivec3 coord)

Back to the question: How would you optimize this loop?

I'm using a GTX 670 with latest drivers on windows.

Solution

Don't worry about these additional steps. It will most likely be done in registers which are 200+ times faster than a single global memory access (texelFetch).

But here is a way to optimize the problem instead of the loop:

Generally speaking the most efficient GPU programs are those where every Thread does as little as possible work and all thread work combined is the same amount as you would need with a sequential algorithm.

Opengls approach is now to calculate every pixel in its own thread on the GPU. This is totally fine for the most cases but in your problem the amount of work per thread is pretty large (32*32*texelFetch).

So how to optimize this problem?

-> Reduce the amout of work per thread

How?

-> Parallel Reduction (http://www.drdobbs.com/architecture-and-design/parallel-pattern-7-reduce/222000718)

Informal description:

You have your 32x32 area.
Instead of calculating the min/max of the complete area you do it in multiple steps.

-> Calculate the min/max of 2x2 blocks (16x16 blocks per area)

-> So now your image is 4 times smaller

-> Do this 5 times

-> You have now the min/max of the complete area