Search code examples
c++gpugpgpumetal

threadgroup_barrier clears memory to 0


I have a very simple kernel, in Metal Shading Language, that uses shared (threadgroup) memory but which is not working, and I can't understand why. It just writes a value to shared memory, and then read and outputs it:

kernel void test_shared(
    uint thread_index [[thread_position_in_grid]],
    uint local_index [[thread_index_in_threadgroup]],
    uint group_index [[threadgroup_position_in_grid]],
    device float* output [[buffer(0)]],
    threadgroup float shared[32])
{
    float value = 1;
    shared[local_index] = value;
    threadgroup_barrier(mem_flags::mem_threadgroup); /// <-- important line
    value = shared[local_index];
    if(local_index == 0) output[group_index] = value;
}

I launch 100 threadgroups, each with 32 threads via this C++ code:

void test()
{
    const unsigned int bufsize = sizeof(float) * 100;
    NS::SharedPtr<MTL::Buffer> buf = NS::TransferPtr(device.newBuffer(bufsize, MTL::ResourceStorageModeShared));

    MTL::CommandBuffer* cmd = queue->commandBuffer();
    MTL::ComputeCommandEncoder* encoder = cmd->computeCommandEncoder();

    encoder->setBuffer(buf.get(), 0, 0);

    encoder->setComputePipelineState(kernel_test_shared.get());
    encoder->dispatchThreadgroups(MTL::Size(100, 1, 1), MTL::Size(32, 1, 1));

    encoder->endEncoding();
    cmd->commit();

    cmd->waitUntilCompleted();

    const float* dat = (const float*)buf->contents();
    printf("buf =\n");
    for(unsigned int k = 0; k < 100; k++)
        printf("    [%u] %f\n", k, dat[k]);
}

About the output at the end of test():

  • Expected: 1
  • Actual: 0. It is 1 only if I remove the threadgroup_barrier(...), but I suspect this would be the result of a compiler optimization.

Why would the threadgroup_barrier seemingly clear the shared memory to 0?


Solution

  • If you are going to use threadgroup memory, you need to use setThreadgroupMemoryLength.

    I would also suggest explicitly specifying the [[threadgroup(n)]] binding point.