I'm writing some software for a 32-bit cortex M0 microcontroller in C and I'm doing alot of manipulations with 32-bit RGB values. They are handled in a 32-bit integer format like 0x00BBRRGG
. I want to be able to do math with them without worrying about carry bits spilling between the colors, so I need to split them up into three uint8 values. Is there an efficient way of doing this? I'm assuming the inefficient way would be as follows:
blue = (RGB >> 16) & 0xFF;
green = (RGB >> 8) & 0xFF;
red = RGB & 0xFF;
//do math
new_RGB = (blue << 16) | (green << 8) | red;
Also, I have a couple of interfaces and one of them uses the format 0x00RRGGBB
and the other uses 0x00BBRRGG
. Is there an efficient way to convert between the two?
I want to be able to do math with them without worrying about carry bits spilling between the colors, so I need to split them up into three uint8 values.
No, usually you do not need to (split them into three uint8 values). Consider this function:
uint32_t blend(const uint32_t argb0, const uint32_t argb1, const int phase)
{
if (phase <= 0)
return argb0;
else
if (phase < 256) {
const uint32_t rb0 = argb0 & 0x00FF00FF;
const uint32_t rb1 = argb1 & 0x00FF00FF;
const uint32_t ag0 = (argb0 >> 8) & 0x00FF00FF;
const uint32_t ag1 = (argb1 >> 8) & 0x00FF00FF;
const uint32_t rb = rb1 * phase + (256 - phase) * rb0;
const uint32_t ag = ag1 * phase + (256 - phase) * ag0;
return ((rb & 0xFF00FF00u) >> 8)
| (ag & 0xFF00FF00u);
} else
return argb1;
}
This function implements a linear blend from color argb0
(phase <= 0
) to argb1
(phase >= 256
), by splitting each input vector (with four 8-bit components) into two vectors with two 16-bit components.
If you don't need the alpha channel, then it may be more efficient to work on pairs of color values (say, for each pair of pixels) -- so (0xRRGGBB
, 0xrrggbb
) is split into (0x00RR00BB
, 0x00rr00bb
, 0x00GG00gg
) -- which in the above blend
function means one less multiplication (but one more AND and one OR operation).
The 32-bit multiplication operation on Cortex-M0 devices varies between implementations. Some have a single-cycle multiplication operation, on others it takes 32 cycles. So, depending on the exact Cortex-M0 core used, replacing one multiplication with an AND and an OR may be a big speedup, or a slight slowdown.
When you actually do need the separate components, then leaving the splitting to the compiler often leads to better code generated: instead of specifying the color, pass a pointer to the color value,
uint32_t some_op(const uint32_t *const argb)
{
const uint32_t a = ((const uint8_t *)argb)[0];
const uint32_t r = ((const uint8_t *)argb)[1];
const uint32_t g = ((const uint8_t *)argb)[2];
const uint32_t b = ((const uint8_t *)argb)[3];
/* Do something ... */
}
This is because many architectures have instructions that load an 8-bit value into a full register, setting all higher bits to zero (zero extend, uxtb
on Cortex-M0 architecture; the C compiler will do this for you). Marking both the pointer and the pointed to value, as well as the intermediate values, const
, should allow the compiler to optimize the access so that it happens at the best moment/position in the generated code, rather than having to keep it in a register. (This is especially true on architectures with few (available) registers, like 32-bit and 64-bit Intel and AMD architectures (x86 and x86-64). Cortex-M0 has 12 general-purpose 32-bit registers, but it depends on the ABI used which ones are "free" to use in a function.)
Note that if you are using GCC to compile your code, you can use
uint32_t oabc_to_ocba(uint32_t c)
{
asm volatile ( "rev %0, %0\n\t"
: "=r" (c)
: "r" (c)
);
return c >> 8;
}
to convert 0x0ABC
to 0x0CBA
and vice versa. Normally, it compiles to rev r0, r0
, lsrs r0, r0, #8
, bx lr
, but the compiler can inline it and use another register instead (of r0
).