I'm trying to implement a color conversion Func
that outputs to 3 separate buffers. The rgb_to_ycocg
function has a 4x8bit channel interleaved buffer (BGRA) and 3 output buffers (Y, Co and Cg) which are each 16bit values. Currently, I'm using this piece of code:
void rgb_to_ycocg(const uint8_t *pSrc, int32_t srcStep, int16_t *pDst[3], int32_t dstStep[3], int width, int height)
{
Buffer<uint8_t> inRgb((uint8_t *)pSrc, 4, width, height);
Buffer<int16_t> outY(pDst[0], width, height);
Buffer<int16_t> outCo(pDst[1], width, height);
Buffer<int16_t> outCg(pDst[2], width, height);
Var x, y, c;
Func calcY, calcCo, calcCg, inRgb16;
inRgb16(c, x, y) = cast<int16_t>(inRgb(c, x, y));
calcY(x, y) = (inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1)) + ((inRgb16(1, x, y) - (inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1))) >> 1);
calcCo(x, y) = inRgb16(2, x, y) - inRgb16(0, x, y);
calcCg(x, y) = inRgb16(1, x, y) - (inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1));
Pipeline p =Pipeline({calcY, calcCo, calcCg});
p.vectorize(x, 16).parallel(y);
p.realize({ outY, outCo, outCg });
}
The issue is, I'm getting poor performance compared to the reference implementation (basic for loops in c). I understand I need to try better scheduling, but I think I'm doing something wrong in terms of input/output buffers. I've seen the tutorials and tried to come up with a way to output to multiple buffers. Using a Pipeline
was the only way I could find. Would I be better off making 3 Func
s and calling them separately? Is this a correct use of the Pipeline
class?
The big possible problem here is that you're making and compiling a code every time you want to convert a single image. That would be really really slow. Use ImageParams instead of Buffers, define the Pipeline once, and then realize it multiple times.
A second-order effect is that I think you actually want a Tuple rather than a Pipeline. A Tuple Func computes all its values in the same inner loop, which will reuse the loads from inRgb, etc. Ignoring the recompilation problem for the moment, try:
void rgb_to_ycocg(const uint8_t *pSrc, int32_t srcStep, int16_t *pDst[3], int32_t dstStep[3], int width, int height)
{
Buffer<uint8_t> inRgb((uint8_t *)pSrc, 4, width, height);
Buffer<int16_t> outY(pDst[0], width, height);
Buffer<int16_t> outCo(pDst[1], width, height);
Buffer<int16_t> outCg(pDst[2], width, height);
Var x, y, c;
Func calcY, calcCo, calcCg, inRgb16;
inRgb16(c, x, y) = cast<int16_t>(inRgb(c, x, y));
out(x, y) = {
inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1)) + ((inRgb16(1, x, y) - (inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1))) >> 1),
inRgb16(2, x, y) - inRgb16(0, x, y),
inRgb16(1, x, y) - (inRgb16(0, x, y) + ((inRgb16(2, x, y) - inRgb16(0, x, y)) >> 1))
};
out.vectorize(x, 16).parallel(y);
out.realize({ outY, outCo, outCg });
}