Memory Leak in a Simple Metal Program

I am trying to learn Metal for scientific programming. I tried creating a simple kernel that did morphological dilation. The issue that I am facing is that the memory seems to be increasing by a couple of KBs every time I call dilate on an image.

I verified the memory leak by running the dilate method in a for loop for 10000 iterations, and watched the allocated memory in Xcode's debug navigator grow from 16MB to 17 MBs.

Is there anything that you see in my code that would contribute to the memory leak? I have also pushed the project to Github in case that helps.

class MorphologyIOS : public Morphology
{
public:
    MorphologyIOS(
        const uint kernel,
        const uint width,
        const uint height
    ) {
        device_ = MTLCreateSystemDefaultDevice();
        kernelSize_ = kernel;
        buffer_ = [device_ newBufferWithBytes:&kernelSize_ length:4 options:MTLStorageModeShared];
        library_ = [device_ newDefaultLibrary];
        commandQueue_ = [device_ newCommandQueue];
        identityFunction_ = [library_ newFunctionWithName:@"identity"];

        MTLTextureDescriptor* readDesc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormat::MTLPixelFormatR8Uint
            width:width height:height mipmapped:false];

        MTLTextureDescriptor* writeDesc = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormat::MTLPixelFormatR8Uint
            width:width height:height mipmapped:false];

        [writeDesc setUsage:MTLTextureUsageShaderWrite];

        inTexture_ = [device_ newTextureWithDescriptor:readDesc];
        outTexture_ = [device_ newTextureWithDescriptor:writeDesc];

        entireImage_ = MTLRegionMake2D(0, 0, width, height);

        pipelineState_ = [device_ newComputePipelineStateWithFunction:identityFunction_ error:NULL];

    }

    virtual ~MorphologyIOS() override {}

    virtual std::shared_ptr<unsigned char> dilate(
        const std::shared_ptr<unsigned char>& inImage
    ) override {
        void* result = malloc(outTexture_.width * outTexture_.height);
        std::shared_ptr<unsigned char> outImage;
        @autoreleasepool
        {
            commandBuffer_ = [commandQueue_ commandBuffer];
            commandEncoder_ = [commandBuffer_ computeCommandEncoder];
            [commandEncoder_ setComputePipelineState:pipelineState_];

            [inTexture_ replaceRegion:entireImage_ mipmapLevel:0 withBytes:inImage.get() bytesPerRow:outTexture_.width];

            [commandEncoder_ setTexture:inTexture_ atIndex:0];
            [commandEncoder_ setTexture:outTexture_ atIndex:1];
            [commandEncoder_ setBuffer:buffer_ offset:0 atIndex:0];

            MTLSize threadGroupCount = MTLSizeMake(10, 10, 1);
            MTLSize threadGroups = MTLSizeMake(inTexture_.width / threadGroupCount.width,
                inTexture_.height / threadGroupCount.height, 1);

            [commandEncoder_ dispatchThreadgroups:threadGroups threadsPerThreadgroup:threadGroupCount];
            [commandEncoder_ endEncoding];
            [commandBuffer_ commit];
            [commandBuffer_ waitUntilCompleted];

            [outTexture_ getBytes:result bytesPerRow:outTexture_.width fromRegion:entireImage_ mipmapLevel:0];
            outImage.reset(reinterpret_cast<unsigned char*>(result));
        }

        return outImage;
    }
private:
    id<MTLDevice> device_;
    uint kernelSize_;
    id<MTLBuffer> buffer_;
    id<MTLLibrary> library_;
    id<MTLComputePipelineState> pipelineState_;
    id<MTLCommandQueue> commandQueue_;
    id<MTLFunction> identityFunction_;
    id<MTLCommandBuffer> commandBuffer_;
    id<MTLComputeCommandEncoder> commandEncoder_;
    id<MTLTexture> inTexture_;
    id<MTLTexture> outTexture_;
    MTLRegion entireImage_;
};

And my kernel looks like this:

kernel void dilation(
    texture2d<uint, access::read> inTexture [[texture(0)]],
    texture2d<uint, access::write> outTexture [[texture(1)]],
    device uint *kernelSize [[buffer(0)]],
    uint2 gid [[thread_position_in_grid]]
) {
    uint halfKernel = kernelSize[0] / 2;
    uint minX = gid.x >= halfKernel ? gid.x - halfKernel : 0;
    uint minY = gid.y >= halfKernel ? gid.y - halfKernel : 0;
    uint maxX = gid.x + halfKernel < inTexture.get_width() ? gid.x + halfKernel : inTexture.get_width();
    uint maxY = gid.y + halfKernel < inTexture.get_height() ? gid.y + halfKernel : inTexture.get_height();
    uint maxValue = 0;
    for (uint i = minX; i <= maxX; i++)
    {
        for (uint j = minY; j <= maxY; j++)
        {
            uint4 value = inTexture.read(uint2(i, j));
            if (maxValue < value[0])
                maxValue = value[0];
        }
    }
    outTexture.write(maxValue, gid);
}

Solution

This isn't so much a bug as it is an artifact of the capture/validation layer doing some bookkeeping on your behalf. Since it won't occur in real-world usage, it's probably not something to worry about.