Search code examples
openglcompute-shaderprefix-sum

Any elegant way deal with array margins in OpenGL Compute Shaders?


Is there any elegant way to deal with array margins in Compute Shaders? (considering you are supposed to have the dimension of the work-group hardcoded in the shader)

Consider the following shader code that computes a prefix sum for a 2048 array if called with glDispatchCompute(1,1,1):

#version 430 core

layout (local_size_x = 1024) in;

layout (binding = 0) coherent readonly buffer block1
{
    float input_data[gl_WorkGroupSize.x];
};

layout (binding = 1) coherent writeonly buffer block2
{
    float output_data[gl_WorkGroupSize.x];
};

shared float shared_data[gl_WorkGroupSize.x * 2];

void main(void)
{
uint id = gl_LocalInvocationID.x;
uint rd_id;
uint wr_id;
uint mask;

const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
uint step = 0;

shared_data[id * 2] = input_data[id * 2];
shared_data[id * 2 + 1] = input_data[id * 2 + 1];

barrier();

for (step = 0; step < steps; step++)
{
    mask = (1 << step) - 1;
    rd_id = ((id >> step) << (step + 1)) + mask;
    wr_id = rd_id + 1 + (id & mask);

    shared_data[wr_id] += shared_data[rd_id];

    barrier();
}

output_data[id * 2] = shared_data[id * 2];
output_data[id * 2 + 1] = shared_data[id * 2 + 1];
}

But what if I want to compute a prefix sum for an array of 3000 elements?


Solution

  • As for dealing with the extra, unusued data, that's easy: allocate more space. Dispatch calls operate on whole multiples of work groups. So you must make sure there is adequate storage for what you dispatch.

    Just leave it uninitialized for the input buffer and ignore it when you read the output in.

    But there are other issues with your shader that will prevent them from working with a dispatch calls:


    You have designed your shader explicitly to only work for a single work group dispatch. That is, no matter how many work groups you dispatch, they will all be reading and writing the same data.

    First, as previously discussed, stop giving an absolute length to the buffer data. You don't know how many work groups will be invoked at compile time; that's a runtime decision. So make the array's size runtime defined.

    layout (binding = 0) readonly buffer block1
    {
        float input_data[];
    };
    
    layout (binding = 1) writeonly buffer block2
    {
        float output_data[];
    };
    

    Also, note the lack of coherent. You are not using these buffers in any way that would require that qualifier.

    Your shared data does still need to have a size.

    Second, each work item is responsible for reading a specific value from input_data and writing a specific value to output_data. In your current code, this index is id, but your current code only computes it based on the work item index within a work group. To compute it for all work items in all work groups, do this:

    const uint id = dot(gl_GlobalInvocationID,
                      vec3(1, gl_NumWorkGroups.x, gl_NumWorkGroups.y * gl_NumWorkGroups.x)
    

    The dot-product is just a fancy way of doing multiplications and then summing the components. gl_GlobalInvocationID is the 3D location globally of each work item. Every work item will have a unique gl_GlobalInvocationId; the dot-product is just turning the 3D location into a 1D index.

    Third, in your actual logic, use gid only for accessing data in your buffers. When accessing data in your shared storage, you need to use gl_LocalInvocationIndex (which is essentially what id used to be):

    const uint lid = gl_LocalInvocationIndex;
    shared_data[lid * 2] = input_data[id * 2];
    shared_data[lid * 2 + 1] = input_data[id * 2 + 1];
    
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((lid >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (lid & mask);
    
        shared_data[wr_id] += shared_data[rd_id];
    
        barrier();
    }
    
    output_data[id * 2] = shared_data[lid * 2];
    output_data[id * 2 + 1] = shared_data[lid * 2 + 1];
    

    It's better to use gl_LocalInvocationIndex instead of gl_LocalInvocationID.x, because you may someday need more work items in a work group than you can get with just one dimension of local size. With gl_LocalInvocationIndex, the index will always take into account all dimensions of the local size.