multithreading macos synchronization gpgpu metal

Synchronizing all threads in a grid in Metal

I am trying to write a norm or a squared length function for an n-sized vector in Metal. To do this, I planned on having every thread square each element, then elect one thread to sum all elements.

Here is my current kernel:

#include <metal_stdlib>
#include <metal_compute>
using namespace metal;

kernel void length_squared(const device float *x [[ buffer(0) ]],
                           device float *s [[ buffer(1) ]],
                           device float *out [[ buffer(2) ]],
                           uint gid [[ thread_position_in_grid ]],
                           uint numElements [[ threads_per_grid ]])
{
    s[gid] = x[gid];// * x[gid];
    simdgroup_barrier(mem_flags::mem_none);
    if(gid == 0){
        for(uint i = 0; i < numElements; i++){
            *out += s[i];
        }
    }
}

Unfortunately, this code does not compile, for "Use of Undeclared Identifier simdgroup_barrier". The method is documented in the Metal Shading Language Specification.

Has anyone encountered this? or know how to synchronize all threads across a grid? threadgroup_barrier does not achieve total synchronization for me.

Am I approaching this problem incorrectly? What is the best way to synchronize this operation?

Solution

A SIMD group is smaller than a threadgroup, so that synchronization won't work.

Instead, you'll want to use a parallel reduction to sum up the values in parallel. Here is some Metal code I found.

Though, if you don't mind a single thread doing all the summing, you can run a separate kernel with just one thread to do the sum. Of course, this can be very slow.