fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture

I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:

void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
    bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height * 4);
    }
    else {
        dest += (dest_y * dest_buffer_width * 4);
        for(uint i=0;i < src_height;i++) {
            memcpy(dest + (dest_x * 4), src, src_width * 4);
            dest += dest_buffer_width * 4;
            src += src_width * 4;
        }
    }
}

It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.

Solution

Your use_single_memcpy test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0 requirement.

void render(guint8* src, guint8* dest,
            uint src_width, uint src_height, 
            uint dest_x, uint dest_y,
            uint dest_buffer_width)
{
    bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
    dest_buffer_width <<= 2;
    src_width <<= 2;
    dest += (dest_y * dest_buffer_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height);
    }
    else {
        dest += (dest_x << 2);
        while (src_height--) {
            memcpy(dest, src, src_width);
            dest += dest_buffer_width;
            src += src_width;
        }
    }
}

I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.

It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.