Search code examples
swiftmacosmetal

Metal outputting one entry in an array


I am super duper new to the Metal framework (and GPGPU in general), trying to play around with code to get a better handle on what's going on.

The following code should simply output the input array's elements, added 5.0. For some reason, however, in the last step when copying the data to the output array, it only outputs

[6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Below is my little program:

import Foundation
import Metal

//Metal Setup
var device = MTLCreateSystemDefaultDevice()!
var commandQueue = device.makeCommandQueue()!
var library = device.makeDefaultLibrary()!
var commandBuffer = commandQueue.makeCommandBuffer()!
var commandEncoder = commandBuffer.makeComputeCommandEncoder()!

//Prepare Inputs/Outputs
var input = [Float](repeating: 1, count: 10)
var output = [Float](repeating: 0, count: 10)
//Setup Pipeline
let kernel = library.makeFunction(name: "sigmoid")!
var pipelineState = try device.makeComputePipelineState(function: kernel)
commandEncoder.setComputePipelineState(pipelineState)

//Create GPU Buffers for input/output
let bytelength = input.count * MemoryLayout<Float>.size

var inputBuffer = device.makeBuffer(bytes: &input, length: bytelength, options: [])
var outputBuffer = device.makeBuffer(bytes: &output, length: bytelength, options: [])

commandEncoder.setBuffer(inputBuffer, offset: 0, index: 0)
commandEncoder.setBuffer(outputBuffer, offset: 0, index: 1)

//Setup threads
var threadsPerGroup = MTLSize(width: 32, height: 1, depth: 1)
var numThreadGroups = MTLSize(width: (input.count + 31)/32, height: 1, depth: 1)
commandEncoder.dispatchThreads(numThreadGroups, threadsPerThreadgroup: threadsPerGroup)

//Start the program on the GPU, wait until finished.
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

//Get result from GPU to CPU data.
var result = [Float](repeating: 0, count: input.count)
var data = NSData(bytesNoCopy: (outputBuffer?.contents())!, length: bytelength, freeWhenDone: false)
data.getBytes(&result, length: bytelength)
print(result)

The kernel code is pretty silly:

#include <metal_stdlib>

using namespace metal;

kernel void sigmoid(const device float *inVector [[ buffer(0) ]],
                    device float *outVector [[buffer(1) ]],
                    uint id [[ thread_position_in_grid ]]) {

    outVector[id] = inVector[id] + 5.0;
}

This seems to be consistent with the tutorials I've been reading. I'm also a little new-ish with Swift 4, so I'm not sure if the code I've written is the best way to go about this. Any help at all is appreciated.


Solution

  • As I figured out later, my numThreadsGroup ensures that there's only one thread per grid. If I change that integer to 10, I.E. the size of my array, then it evaluates correctly. Silly me, but I guess I need to learn more how threads and threadgroups interact with Metal.