Search code examples
image-processinggpgpumetalcompute-shader

Shared memory with Metal compute shaders (how to access data outside of shared thread group memory?)


I've written a Metal compute shader that:

  • accepts a 2-D metal texture
  • writes to two metal textures
    • one with vertical motion blur
    • one with horizontal motion blur

implementation:

constexpr sampler s(coord::pixel, address::clamp_to_edge);

kernel void motion_blur(texture2d<half, access::sample> gray_source [[ texture(0) ]],
                        texture2d<half, access::write> b_hor [[ texture(1) ]],
                        texture2d<half, access::write> b_ver [[ texture(2) ]],
                        uint2 globalId [[ thread_position_in_grid ]]) {
    float2 c = static_cast<float2>(globalId);
    
    // calculate the value of a motion-blurred image.
    half filter_len = 15;
    int lower_bound = int(floor(filter_len / 2.0)) * -1.0;
    int upper_bound = int(floor(filter_len / 2.0) + 1.0);
    half g_x = 0;
    half g_y = 0;
    for (int i = lower_bound; i < upper_bound; i++) {
        half xGray = gray_source.sample(s, c, int2(i, 0)).x / filter_len;
        g_x += xGray;
        half yGray = gray_source.sample(s, c, int2(0, i)).x / filter_len;
        g_y += yGray;
    }
    b_hor.write(g_x, globalId);
    b_ver.write(g_y, globalId);
}

Above, the filter length is set to 15, but I need a filter length of ~30.

Experimentally, filter lengths of greater than 15 do not increase the amount of motion blur generated as I would expect.

I'm not sure, but I have a hunch that this is related to my threadgroup size:

threadgroupSize = MTLSize(width: 16, height: 16, depth: 1)

but increasing the width and height to 32 does not have the desired effect either. I'm guessing that's due to hardware limitations.

I'm new to shader programming, and have even less experience with Metal. What can I do to give each thread access to larger portions of the texture?


Solution

  • Thanks to a pointer from a very helpful and friendly engineer, I now know what my issue was. In the following line, I was abusing the offset of the sampler:

    half xGray = gray_source.sample(s, c, int2(i, 0)).x / filter_len;
    

    I guess only values from -8 to 7 will work. I'm looking through the metal shading language spec for this and will report back if I find it. That said, updating the line as below works just fine:

    half xGray = gray_source.sample(s, float2(c.x + i, c.y)).x / filter_len;
    

    I was just confused about the correct way to specify the coordinate I wanted to sample.