Search code examples
c++x86-64memcpyblitlibdispatch

fastest way to blit image buffer into an xy offset of another buffer in C++ on amd64 architecture


I have image buffers of an arbitrary size that I copy into equal-sized or larger buffers at an x,y offset. The colorspace is BGRA. My current copy method is:

void render(guint8* src, guint8* dest, uint src_width, uint src_height, uint dest_x, uint dest_y, uint dest_buffer_width) {
    bool use_single_memcpy = (dest_x == 0) && (dest_y == 0) && (dest_buffer_width == src_width);

    if(use_single_memcpy) {
        memcpy(dest, src, src_width * src_height * 4);
    }
    else {
        dest += (dest_y * dest_buffer_width * 4);
        for(uint i=0;i < src_height;i++) {
            memcpy(dest + (dest_x * 4), src, src_width * 4);
            dest += dest_buffer_width * 4;
            src += src_width * 4;
        }
    }
}

It runs fast but I was curious if there was anything I could do to improve it and gain a few extra milliseconds. If it involves going to assembly code I'd prefer to avoid that, but I'm willing to add additional libraries.


Solution

  • Your use_single_memcpy test is too restrictive. A slight rearrangement allows you to remove the dest_y == 0 requirement.

    void render(guint8* src, guint8* dest,
                uint src_width, uint src_height, 
                uint dest_x, uint dest_y,
                uint dest_buffer_width)
    {
        bool use_single_memcpy = (dest_x == 0) && (dest_buffer_width == src_width);
        dest_buffer_width <<= 2;
        src_width <<= 2;
        dest += (dest_y * dest_buffer_width);
    
        if(use_single_memcpy) {
            memcpy(dest, src, src_width * src_height);
        }
        else {
            dest += (dest_x << 2);
            while (src_height--) {
                memcpy(dest, src, src_width);
                dest += dest_buffer_width;
                src += src_width;
            }
        }
    }
    

    I've also changed the loop to a countdown (which may be more efficient) and removed a useless temporary variable, and lifted repeated calculations.

    It's likely that you can do even better using SSE intrinsics to copy 16 bytes at a time instead of 4, but then you'll need to worry about alignment and multiples of 4 pixels. A good memcpy implementation should already do these things.