I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:
void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height * 4);
}
else {
dest += (dest_y * dest_buffer_width * 4);
for(uint i=0;i < src_height;i++) {
memcpy(dest + (dest_x * 4), src, src_width * 4);
dest += dest_buffer_width * 4;
src += src_width * 4;
}
}
}
It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.
Your use_single_memcpy
test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0
requirement.
void render(guint8* src, guint8* dest,
uint src_width, uint src_height,
uint dest_x, uint dest_y,
uint dest_buffer_width)
{
bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
dest_buffer_width <<= 2;
src_width <<= 2;
dest += (dest_y * dest_buffer_width);
if(use_single_memcpy) {
memcpy(dest, src, src_width * src_height);
}
else {
dest += (dest_x << 2);
while (src_height--) {
memcpy(dest, src, src_width);
dest += dest_buffer_width;
src += src_width;
}
}
}
I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.
It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.