I am trying to write a norm or a squared length function for an n-sized vector in Metal. To do this, I planned on having every thread square each element, then elect one thread to sum all elements.
Here is my current kernel:
#include <metal_stdlib>
#include <metal_compute>
using namespace metal;
kernel void length_squared(const device float *x [[ buffer(0) ]],
device float *s [[ buffer(1) ]],
device float *out [[ buffer(2) ]],
uint gid [[ thread_position_in_grid ]],
uint numElements [[ threads_per_grid ]])
{
s[gid] = x[gid];// * x[gid];
simdgroup_barrier(mem_flags::mem_none);
if(gid == 0){
for(uint i = 0; i < numElements; i++){
*out += s[i];
}
}
}
Unfortunately, this code does not compile, for "Use of Undeclared Identifier simdgroup_barrier
". The method is documented in the Metal Shading Language Specification.
Has anyone encountered this? or know how to synchronize all threads across a grid? threadgroup_barrier
does not achieve total synchronization for me.
Am I approaching this problem incorrectly? What is the best way to synchronize this operation?
A SIMD group is smaller than a threadgroup, so that synchronization won't work.
Instead, you'll want to use a parallel reduction to sum up the values in parallel. Here is some Metal code I found.
Though, if you don't mind a single thread doing all the summing, you can run a separate kernel with just one thread to do the sum. Of course, this can be very slow.