I am trying to migrate my projects from OpenGL to Metal on iOS. But I seem to have hit a performance wall. The task is simple...
I have a large texture (more than 3000x3000 pixels). On which I need to draw several (a few hundreds) small textures (say 124x124) on each touchesMoved event. And this is while enabling a particular blending function. It is basically like a paint brush. And then display the large texture. This is roughly the task is.
On OpenGL it runs pretty fast. I get around 60fps. When I port the same code to Metal, I could manage to get only 15fps.
I have created two sample projects with bare minimum to demonstrate the problem. Here are the projects (Both OpenGL and Metal)...
https://drive.google.com/file/d/12MPt1nMzE2UL_s4oXEUoTCXYiTz42r4b/view?usp=sharing
This is roughly what I do in OpenGL...
- (void) renderBrush:(GLuint)brush on:(GLuint)fbo ofSize:(CGSize)size at:(CGPoint)point {
GLfloat brushCoordinates[] = {
0.0f, 0.0f,
1.0f, 0.0f,
0.0f, 1.0f,
1.0f, 1.0f,
};
GLfloat imageVertices[] = {
-1.0f, -1.0f,
1.0f, -1.0f,
-1.0f, 1.0f,
1.0f, 1.0f,
};
int brushSize = 124;
CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);
rect.origin.x /= size.width;
rect.origin.y /= size.height;
rect.size.width /= size.width;
rect.size.height /= size.height;
[self convertImageVertices:imageVertices toProjectionRect:rect onImageOfSize:size];
int currentFBO;
glGetIntegerv(GL_FRAMEBUFFER_BINDING, ¤tFBO);
[_Program use];
glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glViewport(0, 0, (int)size.width, (int)size.height);
glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_2D, brush);
glUniform1i(brushTextureLocation, 2);
glVertexAttribPointer(positionLocation, 2, GL_FLOAT, 0, 0, imageVertices);
glVertexAttribPointer(brushCoordinateLocation, 2, GL_FLOAT, 0, 0, brushCoordinates);
glEnable(GL_BLEND);
glBlendEquation(GL_FUNC_ADD);
glBlendFuncSeparate(GL_ONE, GL_ZERO, GL_ONE, GL_ONE);
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
glDisable(GL_BLEND);
glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_2D, 0);
glBindFramebuffer(GL_FRAMEBUFFER, currentFBO);
}
I run this code in a loop (about 200-500) per touch event. It runs pretty fast.
And this is how I have ported the code to Metal...
- (void) renderBrush:(id<MTLTexture>)brush onTarget:(id<MTLTexture>)target at:(CGPoint)point withCommandBuffer:(id<MTLCommandBuffer>)commandBuffer {
int brushSize = 124;
CGRect rect = CGRectMake(point.x - brushSize/2, point.y - brushSize/2, brushSize, brushSize);
rect.origin.x /= target.width;
rect.origin.y /= target.height;
rect.size.width /= target.width;
rect.size.height /= target.height;
Float32 imageVertices[8];
// Calculate the vertices (basically the rectangle that we need to draw) on the target texture that we are going to draw
// We are not drawing on the entire target texture, only on a square around the point
[self composeImageVertices:imageVertices toProjectionRect:rect onImageOfSize:CGSizeMake(target.width, target.height)];
// We use different one vertexBuffer per pass. This is because this is run on a loop and the subsequent calls will overwrite
// The values. Other buffers also get overwritten but that is ok for now, we only need to demonstrate the performance.
id<MTLBuffer> vertexBuffer = [_vertexArray lastObject];
memcpy([vertexBuffer contents], imageVertices, 8 * sizeof(Float32));
id<MTLRenderCommandEncoder> commandEncoder = [commandBuffer renderCommandEncoderWithDescriptor:mRenderPassDescriptor];
commandEncoder.label = @"DrawCE";
[commandEncoder setRenderPipelineState:mPipelineState];
[commandEncoder setVertexBuffer:vertexBuffer offset:0 atIndex:0];
[commandEncoder setVertexBuffer:mBrushTextureBuffer offset:0 atIndex:1];
[commandEncoder setFragmentTexture:brush atIndex:0];
[commandEncoder setFragmentSamplerState:mSampleState atIndex:0];
[commandEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
[commandEncoder endEncoding];
}
And then run this code in a loop with a single MTLCommandBuffer per touch event like...
id<MTLCommandBuffer> commandBuffer = [MetalContext.defaultContext.commandQueue commandBuffer];
commandBuffer.label = @"DrawCB";
dispatch_semaphore_wait(_inFlightSemaphore, DISPATCH_TIME_FOREVER);
mRenderPassDescriptor.colorAttachments[0].texture = target;
__block dispatch_semaphore_t block_sema = _inFlightSemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> buffer) {
dispatch_semaphore_signal(block_sema);
}];
_vertexArray = [[NSMutableArray alloc] init];
for (int i = 0; i < strokes; i++) {
id<MTLBuffer> vertexBuffer = [MetalContext.defaultContext.device newBufferWithLength:8 * sizeof(Float32) options:0];
[_vertexArray addObject:vertexBuffer];
id<MTLTexture> brush = [_brushes objectAtIndex:rand()%_brushes.count];
[self renderBrush:brush onTarget:target at:CGPointMake(x, y) withCommandBuffer:commandBuffer];
x += deltaX;
y += deltaY;
}
[commandBuffer commit];
In the sample code which I have attached, I have replaced the touch events with a timer loop to keep things simple.
On an iPhone 7 Plus, I get 60fps with OpenGL and 15fps with Metal. May be I am doing something horribly wrong here?
Remove all redundancy:
-setVertexBufferOffset:atIndex:
to set just the offset as necessary, without changing the buffer.composeImageVertices:...
can write directly into the vertex buffer with an appropriate cast, avoiding a memcpy
.composeImageVertices:...
actually does and if deltaX
and deltaY
are constants, you may be able to set up the vertex buffer once, ever. The vertex shader can transform the vertices as necessary. You would pass in the appropriate data as uniforms (either the destination point and render target size, or even a transform matrix).mPipelineState
, mBrushTextureBuffer
, and mSampleState
every time.stroke
instances of a single quad. In the vertex shader, transform the position based on the instance ID. You would have to pass deltaX
and deltaY
in as uniform data. The brush indexes can be in a single buffer that's passed in, too, and the shader can look up the brush index in it by the instance ID.