I need a routine that will quickly copy raw 32-bit pixel malloc
-ed data between rectangular regions of one buffer to another.
So... below is my attempt to emulate Apple's drawInRect:fromRect:operation:fraction
method for blitting data out to an NSView. These two routines typically exist in either the NSImage
or NSBitmapImageRep
classes. I'm ignoring operation:
modes or fraction:
alpha blending.
One can assume that the x/y/w/h values have been tested & truncated to ensure that the source/target rectangles lie within the two buffers provided, and that the rectangular regions are non-zero and the same size (ie. no scaling).
My tests indicate copying a full HD (1920x1080) region image on my specific hardware is
Unfortunately, as some of these buffers are provided by an external API, I have no assurances that the buffers are 64- or 128-bit aligned. Having said that, I have a hunch they are in my case -- and memcpy
is testing to see if the buffer address has said alignment and is doing some SSE3 intrinsics to do its business (_platform_memmove$VARIANT$Ivybridge
).
Is there any suggestions on improving this at all?
Or maybe there is some magical routine in the Cocoa API that does this already?
typedef struct copyRect
{
u_int32_t *data;
u_int32_t x;
u_int32_t y;
u_int32_t w;
u_int32_t h;
u_int32_t canvasWidth; // ie. rowBytes/4
} copyRect;
-(void)copyRectFromSrc:(copyRect *)srcImage toTarget:(copyRect *)dstImage
{
u_int32_t h = srcImage->h;
u_int32_t w = srcImage->w;
u_int32_t srcDelta = srcImage->y*srcImage->canvasWidth + srcImage->x;
u_int32_t dstDelta = dstImage->y*dstImage->canvasWidth + dstImage->x;
u_int32_t *srcPtr = srcImage->data+srcDelta;
u_int32_t *dstPtr = dstImage->data+dstDelta;
u_int32_t w2 = w/2;
// scan top-to-bottom in buffer
for (u_int32_t y=0; y<h; y++) {
// case 1: this would work in all cases (single pixel = 32 bits)
// u_int32_t *srcXptr = srcPtr;
// u_int32_t *dstXptr = dstPtr;
// for (u_int32_t x=0; x<w; x++)
// *dstXptr++ = *srcXptr++;
// case 2: this would work if src/dst image were even-width
// u_int64_t *srcXptr = (u_int64_t *)srcPtr;
// u_int64_t *dstXptr = (u_int64_t *)dstPtr;
// for (u_int32_t x=0; x<w2; x++)
// *dstXptr++ = *srcXptr++;
// case 3: this seems to have the best performance (all cases)
memcpy(dstPtr,srcPtr,w*4);
srcPtr += srcImage->canvasWidth;
dstPtr += dstImage->canvasWidth;
}
}
#include <Accelerate/Accelerate.h> // see vImage/Conversion.h
vImage_Error vImageCopyBuffer(const vImage_Buffer *src, const vImage_Buffer *dest, size_t pixelSize, vImage_Flags flags ) VIMAGE_NON_NULL(1,2) __OSX_AVAILABLE_STARTING(__MAC_10_10, __IPHONE_8_0);
For Alpha compositing, see the various alpha blend routines in vImage/Alpha.h.