Slow texture fetch in fragment shader using Vulkan

I am doing a SSAO shader with a kernel size of 64.

SSAO fragment shader:

const int kernelSize = 64;
for (int i = 0; i < kernelSize; i++) {
        //Get sample position
        vec3 s = tbn * ubo.kernel[i].xyz;
        s = s * radius + origin;
        vec4 offset = vec4(s, 1.0);
        offset = ubo.projection * offset;
        offset.xy /= offset.w;
        offset.xy = offset.xy * 0.5 + 0.5;
        float sampleDepth = texture(samplerposition, offset.xy).z;
        float rangeCheck = abs(origin.z - sampleDepth) < radius ? 1.0 : 0.0;
        occlusion += (sampleDepth >= s.z ? 1.0 : 0.0) * rangeCheck;
    }

The samplerposition texture has the format VK_FORMAT_R16G16B16A16_SFLOAT and is uploaded with the flag VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT.

Im using a laptop with a nvidia K1100M graphic card. If I run the code in renderdoc, this shader takes 114 ms. And if I change the kernelSize to 1, it takes 1 ms.

Is this texture fetch time normal? Or can it be that I have set up something wrong somewhere?

Like the layout transition did not go through, so the texture is in VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL instead of VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL.

Solution

GPU memory relies on heavy cache usage, which is very limited if fragments close to each other do not sample texels that are next to each other - also known as a lack of spatial coherence. I would expect about 10x slowdowns or more on random access to a texture versus linear, coherent access. SSAO is very prone to this when used with large radii.

I recommend using smaller radii and optimizing the texture accesses. You're sampling 4 16 bit floats, but you're only using one. Blitting the depth to a separate 16 bit depth only image should give you an easy 4x speedup.