Search code examples
glslnvidiavulkanspir-v

vkCreateComputePipelines takes too long


I encountered a strange problem with compiling Vulkan compute shader. I have this shader (which is not even all that complex)

#version 450
#extension GL_GOOGLE_include_directive : enable
//#extension GL_EXT_debug_printf : enable
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable

#define IS_AVAILABLE_BUFFER_ANN_ENTITIES
#define IS_AVAILABLE_BUFFER_GLOBAL_MUTABLES
#define IS_AVAILABLE_BUFFER_BONES
#define IS_AVAILABLE_BUFFER_WORLD
//#define IS_AVAILABLE_BUFFER_COLLISION_GRID

#include "descriptors_compute.comp"

layout (local_size_x_id = GROUP_SIZE_CONST_ID) in;

#include "utils.comp"

shared float[ANN_MAX_SIZE] tmp1;
shared float[ANN_MAX_SIZE] tmp2;
shared uint[ANN_TOUCHED_BLOCK_COUNT] touched_block_ids;
mat3 rotation_mat_from_yaw_and_pitch(vec2 yaw_and_pitch){
    const vec2 Ss = sin(yaw_and_pitch); // let S denote sin(yaw) and s denote sin(pitch)
    const vec2 Cc = cos(yaw_and_pitch); // let C denote cos(yaw) and c denote cos(pitch)
    const vec4 Cs_cC_Sc_sS = vec4(Cc,Ss) * vec4(Ss.y,Cc,Ss.x);
    return mat3(Cs_cC_Sc_sS.y,-Ss.y,-Cs_cC_Sc_sS.z,Cs_cC_Sc_sS.x,Cc.y,-Cs_cC_Sc_sS.w,Ss.x,0,Cc.x);
}
void main() {
    const uint entity_id = gl_WorkGroupID.x;
    const uint lID = gl_LocalInvocationID.x;
    const uint entities_count = global_mutables.ann_entities;
    if (entity_id < entities_count){
        const AnnEntity entity = ann_entities[entity_id];
        const Bone bone = bones[entity.bone_idx];
        const mat3 rotation = rotation_mat_from_yaw_and_pitch(bone.yaw_and_pitch);
        const uint BLOCK_TOUCH_SENSE_OFFSET = 0;
        const uint LIDAR_LENGTH_SENSE_OFFSET = BLOCK_EXTENDED_SENSORY_FEATURES_LEN*ANN_TOUCHED_BLOCK_COUNT;
        for(uint i=lID;i<ANN_LIDAR_COUNT;i+=GROUP_SIZE){
            const vec3 rotated_lidar_direction = rotation * entity.lidars[i].direction;
            const RayCastResult ray = ray_cast(bone.new_center, rotated_lidar_direction);
            tmp1[LIDAR_LENGTH_SENSE_OFFSET+i] = ray.ratio_of_traversed_length;
        }
        for(uint i = lID;i<ANN_OUTPUT_SIZE;i+=GROUP_SIZE){
            const AnnSparseOutputNeuron neuron = entity.ann_output[i];
            float sum = neuron.bias;
            for(uint j=0;j<neuron.incoming.length();j++){
                sum += tmp1[neuron.incoming[j].src_neuron] * neuron.incoming[j].weight;
            }
            tmp2[i] = max(0,sum);//ReLU activation
        }
        vec2 rotation_change = vec2(0,0);
        for(uint i = lID;i<ANN_OUTPUT_ROTATION_MUSCLES_SIZE;i+=GROUP_SIZE){
            rotation_change += tmp2[ANN_OUTPUT_ROTATION_MUSCLES_OFFSET+i] * ANN_IMPULSES_OF_ROTATION_MUSCLES[i];
        }
        rotation_change = subgroupAdd(rotation_change);
        if(lID==0){
            bones[entity.bone_idx].yaw_and_pitch += rotation_change;
        }
    }
}

The function ray_cast is probably the most complex part of this shader, but I also reuse this exact same function in many other shaders that compile instantly. I was wondering whether GL_KHR_shader_subgroup_arithmetic might be slowing down vkCreateComputePipelines, but if removing it makes no difference. It takes Vulkan over a minute to finish vkCreateComputePipelines. I also have a bunch of utility functions included but I only use a few constants from there and ray_cast, so 90% of that code is unused and should be removed by glslc. Could it be that Vulkan is quietly trying to perform any other kind of optimisation and it's causing the delay? I thought that all optimisations are done by glslc and there is not much postprocessing done on SPIR-V. I use Nvidia with their proprietary drivers by the way.

It really puzzles me why this shader is so slow to create, even though I have other shaders that are ten times longer and more complex and yet they load instantly.

Is there any way to profile this?


Solution

  • Upon closer inspection I noticed that normally all the generated SPIR-V files for my shaders take about 10-30KB. However, this one shader takes 178KB.

    With help of spirv-dis I looked inside the generated assembly and noticed that vast majority of the op-codes was OpConstant. It was because I had structs that looked like

    struct AnnSparseOutputNeuron{
        AnnSparseConnection[ANN_LATENT_CONNECTIONS_PER_OUTPUT_NEURON] incoming;
        float bias;
    };
    

    They contain large arrays. As a result both

    const AnnEntity entity = ann_entities[entity_id];
    

    and

    const AnnSparseOutputNeuron neuron = entity.ann_output[i];
    

    would be compiled to lots of op-codes that write those constant values for every single element of the array. So instead of writing code of the form

    const A a = buffer_of_As[i];
    f(a.some_filed)
    

    it's better to use

    f(buffer_of_As[i].some_filed)
    

    This seems to have solved the problem. I thought that glslc would be smart enough to figure out such optimizations but apparently it's not.