Emulating Apple's drawInRect: for offscreen pixel buffers

I need a routine that will quickly copy raw 32-bit pixel malloc-ed data between rectangular regions of one buffer to another.

So... below is my attempt to emulate Apple's drawInRect:fromRect:operation:fraction method for blitting data out to an NSView. These two routines typically exist in either the NSImage or NSBitmapImageRep classes. I'm ignoring operation: modes or fraction: alpha blending.

One can assume that the x/y/w/h values have been tested & truncated to ensure that the source/target rectangles lie within the two buffers provided, and that the rectangular regions are non-zero and the same size (ie. no scaling).

My tests indicate copying a full HD (1920x1080) region image on my specific hardware is

case 1: 32-bit transfers: 6.74ms
case 2: 64-bit transfers: 5.30ms
case 3: memcpy transfers: 3.20ms

Unfortunately, as some of these buffers are provided by an external API, I have no assurances that the buffers are 64- or 128-bit aligned. Having said that, I have a hunch they are in my case -- and memcpy is testing to see if the buffer address has said alignment and is doing some SSE3 intrinsics to do its business (_platform_memmove$VARIANT$Ivybridge).

Is there any suggestions on improving this at all?

Or maybe there is some magical routine in the Cocoa API that does this already?

typedef struct copyRect
{
    u_int32_t   *data;
    u_int32_t   x;
    u_int32_t   y;
    u_int32_t   w;
    u_int32_t   h;
    u_int32_t   canvasWidth;  // ie. rowBytes/4
} copyRect;

-(void)copyRectFromSrc:(copyRect *)srcImage toTarget:(copyRect *)dstImage
{
    u_int32_t h = srcImage->h;
    u_int32_t w = srcImage->w;

    u_int32_t srcDelta = srcImage->y*srcImage->canvasWidth + srcImage->x;
    u_int32_t dstDelta = dstImage->y*dstImage->canvasWidth + dstImage->x;
    u_int32_t *srcPtr = srcImage->data+srcDelta;
    u_int32_t *dstPtr = dstImage->data+dstDelta;
    u_int32_t w2 = w/2;

    // scan top-to-bottom in buffer
    for (u_int32_t y=0; y<h; y++) {

// case 1: this would work in all cases (single pixel = 32 bits)
//        u_int32_t *srcXptr = srcPtr;
//        u_int32_t *dstXptr = dstPtr;
//        for (u_int32_t x=0; x<w; x++)
//            *dstXptr++ = *srcXptr++;

// case 2: this would work if src/dst image were even-width
//        u_int64_t *srcXptr = (u_int64_t *)srcPtr;
//        u_int64_t *dstXptr = (u_int64_t *)dstPtr;
//        for (u_int32_t x=0; x<w2; x++)
//            *dstXptr++ = *srcXptr++;

// case 3: this seems to have the best performance (all cases)
        memcpy(dstPtr,srcPtr,w*4);

        srcPtr += srcImage->canvasWidth;
        dstPtr += dstImage->canvasWidth;
    }
}

Solution

#include <Accelerate/Accelerate.h>  // see vImage/Conversion.h

vImage_Error vImageCopyBuffer(const vImage_Buffer *src, const vImage_Buffer *dest, size_t pixelSize, vImage_Flags flags ) VIMAGE_NON_NULL(1,2) __OSX_AVAILABLE_STARTING(__MAC_10_10, __IPHONE_8_0);

For Alpha compositing, see the various alpha blend routines in vImage/Alpha.h.