What can cause lag in recurrent calls to the draw() function of a MetalKit MTKView

I am designing a Cocoa application using the swift 4.0 MetalKit API for macOS 10.13. Everything I report here was done on my 2015 MBPro.

I have successfully implemented an MTKView which renders simple geometry with low vertex count very well (Cubes, triangles, etc.). I implemented a mouse-drag based camera which rotates, strafes and magnifies. Here is a screenshot of the xcode FPS debug screen while I rotate the cube:

However, when I try loading a dataset which contains only ~1500 vertices (which are each stored as 7 x 32bit Floats... ie: 42 kB total), I start getting a very bad lag in FPS. I will show the code implementation lower. Here is a screenshot (note that on this image, the view only encompasses a few of the vertices, which are rendered as large points) :

Here is my implementation:

1) viewDidLoad() :

override func viewDidLoad() {

    super.viewDidLoad()

    // Initialization of the projection matrix and camera
    self.projectionMatrix = float4x4.makePerspectiveViewAngle(float4x4.degrees(toRad: 85.0),
                                      aspectRatio: Float(self.view.bounds.size.width / self.view.bounds.size.height),
                                      nearZ: 0.01, farZ: 100.0)
    self.vCam = ViewCamera()

    // Initialization of the MTLDevice
    metalView.device = MTLCreateSystemDefaultDevice()
    device = metalView.device
    metalView.colorPixelFormat = .bgra8Unorm

    // Initialization of the shader library
    let defaultLibrary = device.makeDefaultLibrary()!
    let fragmentProgram = defaultLibrary.makeFunction(name: "basic_fragment")
    let vertexProgram = defaultLibrary.makeFunction(name: "basic_vertex")

    // Initialization of the MTLRenderPipelineState
    let pipelineStateDescriptor = MTLRenderPipelineDescriptor()
    pipelineStateDescriptor.vertexFunction = vertexProgram
    pipelineStateDescriptor.fragmentFunction = fragmentProgram
    pipelineStateDescriptor.colorAttachments[0].pixelFormat = .bgra8Unorm
    pipelineState = try! device.makeRenderPipelineState(descriptor: pipelineStateDescriptor)

    // Initialization of the MTLCommandQueue
    commandQueue = device.makeCommandQueue()

    // Initialization of Delegates and BufferProvider for View and Projection matrix MTLBuffer
    self.metalView.delegate = self
    self.metalView.eventDelegate = self
    self.bufferProvider = BufferProvider(device: device, inflightBuffersCount: 3, sizeOfUniformsBuffer: MemoryLayout<Float>.size * float4x4.numberOfElements() * 2)
}

2) Loading of the MTLBuffer for the Cube vertices :

private func makeCubeVertexBuffer() {

    let cube = Cube()
    let vertices = cube.verticesArray
    var vertexData = Array<Float>()
    for vertex in vertices{
        vertexData += vertex.floatBuffer()
    }
    VDataSize = vertexData.count * MemoryLayout.size(ofValue: vertexData[0])
    self.vertexBuffer = device.makeBuffer(bytes: vertexData, length: VDataSize!, options: [])!
    self.vertexCount = vertices.count
}

3) Loading of the MTLBuffer for the dataset vertices. Note that I explicitly declare the storage mode of this buffer as Private in order to ensure efficient access to the data by the GPU since the CPU does not need to access the data once the buffer is loaded. Also, note that I am loading only 1/100th of the vertices in my actual dataset because the entire OS on my machine starts lagging when I try to load it entirely (only 4.2 MB of data).

public func loadDataset(datasetVolume: DatasetVolume) {

    // Load dataset vertices
    self.datasetVolume = datasetVolume
    self.datasetVertexCount = self.datasetVolume!.vertexCount/100
    let rgbaVertices = self.datasetVolume!.rgbaPixelVolume[0...(self.datasetVertexCount!-1)]
    var vertexData = Array<Float>()
    for vertex in rgbaVertices{
            vertexData += vertex.floatBuffer()
    }
    let dataSize = vertexData.count * MemoryLayout.size(ofValue: vertexData[0])

    // Make two MTLBuffer's: One with Shared storage mode in which data is initially loaded, and a second one with Private storage mode
    self.datasetVertexBuffer = device.makeBuffer(bytes: vertexData, length: dataSize, options: MTLResourceOptions.storageModeShared)
    self.datasetVertexBufferGPU = device.makeBuffer(length: dataSize, options: MTLResourceOptions.storageModePrivate)

    // Create a MTLCommandBuffer and blit the vertex data from the Shared MTLBuffer to the Private MTLBuffer
    let commandBuffer = self.commandQueue.makeCommandBuffer()
    let blitEncoder = commandBuffer!.makeBlitCommandEncoder()
    blitEncoder!.copy(from: self.datasetVertexBuffer!, sourceOffset: 0, to: self.datasetVertexBufferGPU!, destinationOffset: 0, size: dataSize)
    blitEncoder!.endEncoding()
    commandBuffer!.commit()

    // Clean up
    self.datasetLoaded = true
    self.datasetVertexBuffer = nil
}

4) Finally, here is the render loop. Again, this is using MetalKit.

func draw(in view: MTKView) {
    render(view.currentDrawable)
}

private func render(_ drawable: CAMetalDrawable?) {
    guard let drawable = drawable else { return }

    // Make sure an MTLBuffer for the View and Projection matrices is available
    _ = self.bufferProvider?.availableResourcesSemaphore.wait(timeout: DispatchTime.distantFuture)

    // Initialize common RenderPassDescriptor
    let renderPassDescriptor = MTLRenderPassDescriptor()
    renderPassDescriptor.colorAttachments[0].texture = drawable.texture
    renderPassDescriptor.colorAttachments[0].loadAction = .clear
    renderPassDescriptor.colorAttachments[0].clearColor = Colors.White
    renderPassDescriptor.colorAttachments[0].storeAction = .store

    // Initialize a CommandBuffer and add a CompletedHandler to release an MTLBuffer from the BufferProvider once the GPU is done processing this command
    let commandBuffer = self.commandQueue.makeCommandBuffer()
    commandBuffer?.addCompletedHandler { (_) in
        self.bufferProvider?.availableResourcesSemaphore.signal()
    }

    // Update the View matrix and obtain an MTLBuffer for it and the projection matrix
    let camViewMatrix = self.vCam.getLookAtMatrix()
    let uniformBuffer = bufferProvider?.nextUniformsBuffer(projectionMatrix: projectionMatrix, camViewMatrix: camViewMatrix)

    // Initialize a MTLParallelRenderCommandEncoder
    let parallelEncoder = commandBuffer?.makeParallelRenderCommandEncoder(descriptor: renderPassDescriptor)

    // Create a CommandEncoder for the cube vertices if its data is loaded
    if self.cubeLoaded == true {
        let cubeRenderEncoder = parallelEncoder?.makeRenderCommandEncoder()
        cubeRenderEncoder!.setCullMode(MTLCullMode.front)
        cubeRenderEncoder!.setRenderPipelineState(pipelineState)
        cubeRenderEncoder!.setTriangleFillMode(MTLTriangleFillMode.fill)
        cubeRenderEncoder!.setVertexBuffer(self.cubeVertexBuffer, offset: 0, index: 0)
        cubeRenderEncoder!.setVertexBuffer(uniformBuffer, offset: 0, index: 1)
        cubeRenderEncoder!.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vertexCount!, instanceCount: self.cubeVertexCount!/3)
        cubeRenderEncoder!.endEncoding()
    }

    // Create a CommandEncoder for the dataset vertices if its data is loaded
    if self.datasetLoaded == true {
        let rgbaVolumeRenderEncoder = parallelEncoder?.makeRenderCommandEncoder()
        rgbaVolumeRenderEncoder!.setRenderPipelineState(pipelineState)
        rgbaVolumeRenderEncoder!.setVertexBuffer( self.datasetVertexBufferGPU!, offset: 0, index: 0)
        rgbaVolumeRenderEncoder!.setVertexBuffer(uniformBuffer, offset: 0, index: 1)
        rgbaVolumeRenderEncoder!.drawPrimitives(type: .point, vertexStart: 0, vertexCount: datasetVertexCount!, instanceCount: datasetVertexCount!)
        rgbaVolumeRenderEncoder!.endEncoding()
    }

    // End CommandBuffer encoding and commit task
    parallelEncoder!.endEncoding()
    commandBuffer!.present(drawable)
    commandBuffer!.commit()
}

Alright, so these are the steps I have been through in trying to figure out what was causing the lag, keeping in mind that the lagging effect is proportional to the size of the dataset's vertex buffer:

I initially though it was due to the GPU not being able to access the memory quickly enough because it was in Shared storage mode -> I changed the dataset MTLBuffer to Private storage mode. This did not solve the problem.
I then though that the problem was due to the CPU spending too much time in my render() function. This could possibly be due to a problem with the BufferProvider or maybe because somehow the CPU was trying to somehow reprocess/reload the dataset vertex buffer every frame -> In order to check this, I used the Time Profiler in xcode's Instruments. Unfortunately, it seems that the problem is that the application calls this render method (in other words, MTKView's draw() method) only very rarely. Here are some screenshots :

The spike at ~10 seconds is when the cube is loaded
The spikes between ~25-35 seconds are when the dataset is loaded

This image (^) shows the activity between ~10-20 seconds, right after the cube was loaded. This is when the FPS is at ~60. You can see that the main thread spends around 53ms in the render() function during these 10 seconds.

This image (^) shows the activity between ~40-50 seconds, right after the dataset was loaded. This is when the FPS is < 10. You can see that the main thread spends around 4ms in the render() function during these 10 seconds. As you can see, none of the methods which are usually called from within this function are called (ie: the ones we can see called when only the cube is loaded, previous image). Of note, when I load the dataset, the time profiler's timer starts to jump (ie: it stops for a few seconds and then jumps to the current time... repeat).

So this is where I am. The problem seems to be that the CPU somehow gets overloaded with these 42 kB of data... recursively. I also did a test with the Allocator in xcode's Instruments. No signs of memory leak, as far as I could tell (You might have noticed that a lot of this is new to me).

Sorry for the convoluted post, I hope it's not too hard to follow. Thank you all in advance for your help.

Edit:

Here are my shaders, in case you would like to see them:

struct VertexIn{
    packed_float3 position;
    packed_float4 color;
};

struct VertexOut{
    float4 position [[position]];  
    float4 color;
    float  size [[point_size]];
};

struct Uniforms{
    float4x4 cameraMatrix;
    float4x4 projectionMatrix;
};


vertex VertexOut basic_vertex(const device VertexIn* vertex_array [[ buffer(0) ]],
                              constant Uniforms&  uniforms    [[ buffer(1) ]],
                              unsigned int vid [[ vertex_id ]]) {

    float4x4 cam_Matrix = uniforms.cameraMatrix;
    float4x4 proj_Matrix = uniforms.projectionMatrix;

    VertexIn VertexIn = vertex_array[vid];

    VertexOut VertexOut;
    VertexOut.position = proj_Matrix * cam_Matrix * float4(VertexIn.position,1);
    VertexOut.color = VertexIn.color;
    VertexOut.size = 15;

    return VertexOut;
}

fragment half4 basic_fragment(VertexOut interpolated [[stage_in]]) {
    return half4(interpolated.color[0], interpolated.color[1], interpolated.color[2], interpolated.color[3]);
}

Solution

I think the main problem is that you're telling Metal to do instanced drawing when you shouldn't be. This line:

rgbaVolumeRenderEncoder!.drawPrimitives(type: .point, vertexStart: 0, vertexCount: datasetVertexCount!, instanceCount: datasetVertexCount!)

is telling Metal to draw datasetVertexCount! instances of each of datasetVertexCount! vertexes. The GPU work is growing with the square of the vertex count. Also, since you don't make use of the instance ID to, for example, tweak the vertex position, all of these instances are identical and thus redundant.

I think the same applies to this line:

cubeRenderEncoder!.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: vertexCount!, instanceCount: self.cubeVertexCount!/3)

although it's not clear what self.cubeVertexCount! is and whether it grows with vertexCount. In any case, since it seems you're using the same pipeline state and thus same shaders which don't make use of the instance ID, it's still useless and wasteful.

Other things:

Why are you using MTLParallelRenderCommandEncoder when you're not actually using the parallelism that it enables? Don't do that.

Everywhere you're using the size method of MemoryLayout, you should almost certainly be using stride instead. And if you're computing the stride of a compound data structure, do not take the stride of one element of that structure and multiply by the number of elements. Take the stride of the whole data structure.