Metal much slower compared to OpenGL while rendering small textures on a large texture

I am trying to migrate my projects from OpenGL to Metal on iOS. But I seem to have hit a performance wall. The task is simple...

I have a large texture (more than 3000x3000 pixels). On which I need to draw several (a few hundreds) small textures (say 124x124) on each touchesMoved event. And this is while enabling a particular blending function. It is basically like a paint brush. And then display the large texture. This is roughly the task is.

On OpenGL it runs pretty fast. I get around 60fps. When I port the same code to Metal, I could manage to get only 15fps.

I have created two sample projects with bare minimum to demonstrate the problem. Here are the projects (Both OpenGL and Metal)...

https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing

This is roughly what I do in OpenGL...

    - (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point {
    GLfloat brushCoordinates[] = {
        0.0f, 0.0f,
        1.0f, 0.0f,
        0.0f,  1.0f,
        1.0f,  1.0f,
    };

    GLfloat imageVertices[] = {
        -1.0f, -1.0f,
        1.0f, -1.0f,
        -1.0f,  1.0f,
        1.0f,  1.0f,
    };

    int brushSize = 124;

    CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

    rect.origin.x /= size.width;
    rect.origin.y /= size.height;
    rect.size.width /= size.width;
    rect.size.height /= size.height;

    [self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];

    int currentFBO;
    glGetIntegerv(GL_FRAMEBUFFER_BINDING, &currentFBO);

    [_Program use];

    glBindFramebuffer(GL_FRAMEBUFFER, fbo);
    glViewport(0, 0, (int)size.width, (int)size.height);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, brush);
    glUniform1i(brushTextureLocation, 2);

    glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
    glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);

    glEnable(GL_BLEND);
    glBlendEquation(GL_FUNC_ADD);
    glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);

    glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);

    glDisable(GL_BLEND);

    glActiveTexture(GL_TEXTURE2);
    glBindTexture(GL_TEXTURE_2D, 0);

    glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);
}

I run this code in a loop (about 200-500) per touch event. It runs pretty fast.

And this is how I have ported the code to Metal...

- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer {

int brushSize = 124;

CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);

rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;

Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];

// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];

memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));

id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";

[commandEncoder setRenderPipelineState:mPipelineState];

[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];

[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];

[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];

}

And then run this code in a loop with a single MTLCommandBuffer per touch event like...

    id<MTLCommandBuffer> commandBuffer = [MetalContext.defaultContext.commandQueue commandBuffer];
commandBuffer.label = @"DrawCB";

dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);

mRenderPassDescriptor.colorAttachments[0].texture = target;

__block dispatch_semaphore_t block_sema = _inFlightSemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
    dispatch_semaphore_signal(block_sema);
}];

_vertexArray = [[NSMutableArray alloc] init];
for (int i = 0; i < strokes; i++) {
    id<MTLBuffer> vertexBuffer = [MetalContext.defaultContext.device newBufferWithLength:8 * sizeof(Float32) options:0];
    [_vertexArray addObject:vertexBuffer];

    id<MTLTexture> brush = [_brushes objectAtIndex:rand()%_brushes.count];
    [self renderBrush:brush onTarget:target at:CGPointMake(x, y) withCommandBuffer:commandBuffer];
    x += deltaX;
    y += deltaY;
}

[commandBuffer commit];

In the sample code which I have attached, I have replaced the touch events with a timer loop to keep things simple.

On an iPhone 7 Plus, I get 60fps with OpenGL and 15fps with Metal. May be I am doing something horribly wrong here?

Solution

Remove all redundancy:

Don't create buffers at render time. Allocate sufficient buffers during initialization.
Don't create a command encoder for every quad.
Use one big vertex buffer with different (properly aligned) offsets for each quad. Use -setVertexBufferOffset:atIndex: to set just the offset as necessary, without changing the buffer.
composeImageVertices:... can write directly into the vertex buffer with an appropriate cast, avoiding a memcpy.
Depending on what composeImageVertices:... actually does and if deltaX and deltaY are constants, you may be able to set up the vertex buffer once, ever. The vertex shader can transform the vertices as necessary. You would pass in the appropriate data as uniforms (either the destination point and render target size, or even a transform matrix).
Assuming they're the same every time, don't set mPipelineState, mBrushTextureBuffer, and mSampleState every time.
If any quads share the same brush texture, group them together and do one draw command to draw them all. This may require switching to triangle primitives instead of triangle strip primitives. However, if you do an indexed draw, you can use the primitive restart sentinel to draw multiple triangle strips in one draw command.
You can even do multiple brushes in one draw command if the count doesn't exceed the number of textures allowed (31). Pass all of the brush textures to the fragment shader. It can receive them as a texture array. The vertex data would include the brush index, the vertex shader would pass that forward, the fragment shader would use it to look up the texture to sample from the array.
You could use instanced drawing to draw everything in a single command. Draw stroke instances of a single quad. In the vertex shader, transform the position based on the instance ID. You would have to pass deltaX and deltaY in as uniform data. The brush indexes can be in a single buffer that's passed in, too, and the shader can look up the brush index in it by the instance ID.
Have you considered using point primitives instead of quads? That would reduce the number of vertexes and give Metal information that it can used to optimize rasterization.