c colors bit-manipulation integer-arithmetic

Efficient way to split up RGB values in C

I'm writing some software for a 32-bit cortex M0 microcontroller in C and I'm doing alot of manipulations with 32-bit RGB values. They are handled in a 32-bit integer format like 0x00BBRRGG. I want to be able to do math with them without worrying about carry bits spilling between the colors, so I need to split them up into three uint8 values. Is there an efficient way of doing this? I'm assuming the inefficient way would be as follows:

blue = (RGB >> 16) & 0xFF;
green = (RGB >> 8) & 0xFF;
red = RGB & 0xFF;

//do math

new_RGB = (blue << 16) | (green << 8) | red;

Also, I have a couple of interfaces and one of them uses the format 0x00RRGGBB and the other uses 0x00BBRRGG. Is there an efficient way to convert between the two?

Solution

I want to be able to do math with them without worrying about carry bits spilling between the colors, so I need to split them up into three uint8 values.

No, usually you do not need to (split them into three uint8 values). Consider this function:

uint32_t blend(const uint32_t argb0, const uint32_t argb1, const int phase)
{
    if (phase <= 0)
        return argb0;
    else
    if (phase < 256) {
        const uint32_t rb0 = argb0 & 0x00FF00FF;
        const uint32_t rb1 = argb1 & 0x00FF00FF;
        const uint32_t ag0 = (argb0 >> 8) & 0x00FF00FF;
        const uint32_t ag1 = (argb1 >> 8) & 0x00FF00FF;
        const uint32_t rb = rb1 * phase + (256 - phase) * rb0;
        const uint32_t ag = ag1 * phase + (256 - phase) * ag0;
        return ((rb & 0xFF00FF00u) >> 8)
             |  (ag & 0xFF00FF00u);
    } else
        return argb1;
}

This function implements a linear blend from color argb0 (phase <= 0) to argb1 (phase >= 256), by splitting each input vector (with four 8-bit components) into two vectors with two 16-bit components.

If you don't need the alpha channel, then it may be more efficient to work on pairs of color values (say, for each pair of pixels) -- so (0xRRGGBB, 0xrrggbb) is split into (0x00RR00BB, 0x00rr00bb, 0x00GG00gg) -- which in the above blend function means one less multiplication (but one more AND and one OR operation).

The 32-bit multiplication operation on Cortex-M0 devices varies between implementations. Some have a single-cycle multiplication operation, on others it takes 32 cycles. So, depending on the exact Cortex-M0 core used, replacing one multiplication with an AND and an OR may be a big speedup, or a slight slowdown.

When you actually do need the separate components, then leaving the splitting to the compiler often leads to better code generated: instead of specifying the color, pass a pointer to the color value,

uint32_t  some_op(const uint32_t *const argb)
{
    const uint32_t  a = ((const uint8_t *)argb)[0];
    const uint32_t  r = ((const uint8_t *)argb)[1];
    const uint32_t  g = ((const uint8_t *)argb)[2];
    const uint32_t  b = ((const uint8_t *)argb)[3];

    /* Do something ... */

}

This is because many architectures have instructions that load an 8-bit value into a full register, setting all higher bits to zero (zero extend, uxtb on Cortex-M0 architecture; the C compiler will do this for you). Marking both the pointer and the pointed to value, as well as the intermediate values, const, should allow the compiler to optimize the access so that it happens at the best moment/position in the generated code, rather than having to keep it in a register. (This is especially true on architectures with few (available) registers, like 32-bit and 64-bit Intel and AMD architectures (x86 and x86-64). Cortex-M0 has 12 general-purpose 32-bit registers, but it depends on the ABI used which ones are "free" to use in a function.)

Note that if you are using GCC to compile your code, you can use

uint32_t oabc_to_ocba(uint32_t c)
{
    asm volatile ( "rev %0, %0\n\t"
                 : "=r" (c)
                 : "r" (c)
                 );
    return c >> 8;
}

to convert 0x0ABC to 0x0CBA and vice versa. Normally, it compiles to rev r0, r0, lsrs r0, r0, #8, bx lr, but the compiler can inline it and use another register instead (of r0).