The calculation of input value to LUT index is constant over multiple calls, therefore I calculate the contents of 'indexToLut' upfront. However, this also means that the checks on the values in that buffer cannot be done here. The LUT itself has only 17 elements.
#define LUT_SIZE 17 /* Size in each dimension of the 4D LUT */
class ApplyLut : public Halide::Generator<ApplyLut> {
public:
// We declare the Inputs to the Halide pipeline as public
// member variables. They'll appear in the signature of our generated
// function in the same order as we declare them.
Input < Buffer<uint8_t>> Lut { "Lut" , 1}; // LUT to apply
Input < Buffer<int>> indexToLut { "indexToLut" , 1}; // Precalculated mapping of uint8_t to LUT index
Input < Buffer<uint8_t >> inputImageLine { "inputImageLine" , 1}; // Input line
Output< Buffer<uint8_t >> outputImageLine { "outputImageLine", 1}; // Output line
void generate();
};
HALIDE_REGISTER_GENERATOR(ApplyLut, outputImageLine)
void ApplyLut::generate()
{
Var x("x");
outputImageLine(x) = Lut(indexToLut(inputImageLine(x)));
inputImageLine .dim(0).set_min(0); // Input image sample index
outputImageLine.dim(0).set_bounds(0, inputImageLine.dim(0).extent()); // Output line matches input line
Lut .dim(0).set_bounds(0, LUT_SIZE); //iccLut[...]: , limited number of values
indexToLut .dim(0).set_bounds(0, 256); //chan4_offset[...]: value index: 256 values
}
In question Are there any restrictions with LUT: unbounded way in dimension, it is already stated that such an issue can be solved by using 'clamp' functionality.
This will change the expression to
outputImageLine(x) = Lut(clamp(indexToLut(inputImageLine(x)), 0, LUT_SIZE));
However, the generated code shows the following expression
outputImageLine[outputImageLine.s0.x] = Lut[max(min(indexToLut[int32(inputImageLine[outputImageLine.s0.x])], 17), 0)]
I think that this means that the execution will do a min/max evaluation which can be omitted in my case, because I know that all values of indexToLut are limited to 0..16. Is there a way to avoid the execution overhead in such a case?
You can use unsafe_promise_clamped instead of clamp to promise that the input is bounded in the way you describe. It might not be any faster though - min and max on integer indices is very cheap compared to the indirect load.