Search code examples
cperformancealphablending

Fast Alpha Blending (CPU only)


I'm currently optimizing my manual (CPU driven) alpha blending (pixel b over pixel a according to the alpha value of pixel b).

I'm currently using:

uint8_t invAlpha = !Alpha;
uint8_t R = (Alpha * r_src + invAlpha * r_dst) >> 8;
uint8_t G = (Alpha * g_src + invAlpha * g_dst) >> 8;
uint8_t B = (Alpha * b_src + invAlpha * b_dst) >> 8;
*pDstPix = (255 << 24) | (R << 16) | (G << 8) | B;

Now I have found the paper "Alpha Blending with No Division Operations" from: https://arxiv.org/pdf/2202.02864

A passage in this paper explains that you can have several 8-bit components alpha blended in parallel in the same register.

Unfortunately, in the example there, only blending is carried out with a pixel's own alpha value. (I don't even know why that makes sense...)

No matter, I just can't figure out whether, and especially how, to base my algorithm (of the three individual RGB blendings) on this. How can my Algorithem be accelerated?

(min. 80386 / ARM Cortex-M3 or higher, no assembler (the compiler should optimize itself for all supported platforms.)


Solution

  • You can't speed up your code by much as it is. Most likely the compiler can vectorize this (if your CPU has SIMD) and the bottleneck will be waiting for memory.

    Your implementation already uses the simpler approximation of dividing by 256 (>> 8) and rounding towards zero. The paper is about speeding up the more precise variant, which divides by 255 and rounds to the nearest value.

    Precise Variant

    I'm assuming you mean invAlpha = 255 - Alpha. When you blend a full white with full opacity (255 for the red channel) then your code calculates (255 * 255 + 0 * r_dst) >> 8 which is floor((255*255)/256) = 254.

    You could implement the precise version like this:

    uint8_t R = (Alpha * r_src + (255-Alpha) * r_dst + (255/2)) / 255;
    

    This is not as bad as it looks. Division instructions are very slow, but modern compilers can usually avoid them when dividing by a constant.

    Checking the gdk-pixbuf implementation is probably a better idea, they do rounding like this:

    uint8_t a1 = 0xff - a0;
    uint16_t tmp = a0 * r_src + a1 * r_dst + 0x80;
    uint8_t R = (tmp + (tmp >> 8)) >> 8;
    

    Variant: Pre-Multiplied Alpha

    Unfortunately, in the example there, only blending is carried out with a pixel's own alpha value. (I don't even know why that makes sense...)

    It makes sense when you implement something like layers in GIMP, or a browser rendering engine even, that blends two layers together while the output still has an alpha channel. If the result doesn't have an alpha channel (but the source has), then you may be able to speed blending up slightly by storing your data in pre-multiplied alpha format.

    For example if you render a game character over a background image, your sprite uses an alpha channel to smooth out the edges. Most of the time you don't want the character to be translucent, and you render like this:

    uint8_t R = (a_src * r_src + (255-a_src) * r_dst) >> 8;
    uint8_t G = (a_src * g_src + (255-a_src) * g_dst) >> 8;
    uint8_t B = (a_src * b_src + (255-a_src) * b_dst) >> 8;
    

    Now instead of storing the red channel r_src in your bitmap, you can store ra_src = (a_src * r_src) >> 8, and your rendering code can avoid one multiplication:

    uint8_t R = ra_src + ((255-a_src) * r_dst) >> 8;
    

    Variant: Run-Length Encoding

    The other trick to speed up CPU-rendering is called run-length encoding (RLE), where you don't store pixels with an alpha of zero (which may be more than half of your sprite) but instead store the number of transparent pixels that follow, so the rendering code can skip them.

    This gives a speed-up on older/simpler CPUs. Modern CPUs will probably not gain as much from this, because it messes with the branch predictor and may prevent automatic use of SIMD registers (depending on how you implement it). It may still be worth it - reducing the amount of memory access usually is more important than avoiding a multiplication.