Search code examples
halide

Halide: How to avoid unwanted execution overhead in Halide LUT index


The calculation of input value to LUT index is constant over multiple calls, therefore I calculate the contents of 'indexToLut' upfront. However, this also means that the checks on the values in that buffer cannot be done here. The LUT itself has only 17 elements.

#define LUT_SIZE 17     /* Size in each dimension of the 4D LUT */

class ApplyLut : public Halide::Generator<ApplyLut> {
public:
    // We declare the Inputs to the Halide pipeline as public
    // member variables. They'll appear in the signature of our generated
    // function in the same order as we declare them.
  Input <  Buffer<uint8_t>> Lut              { "Lut"            , 1};  // LUT to apply
  Input <  Buffer<int>> indexToLut           { "indexToLut"     , 1};  // Precalculated mapping of uint8_t to LUT index
  Input <  Buffer<uint8_t >> inputImageLine  { "inputImageLine" , 1};  // Input line
  Output<  Buffer<uint8_t >> outputImageLine { "outputImageLine", 1};  // Output line
  void generate();
};

HALIDE_REGISTER_GENERATOR(ApplyLut, outputImageLine)

void ApplyLut::generate()
{
  Var x("x");

  outputImageLine(x) = Lut(indexToLut(inputImageLine(x)));

  inputImageLine .dim(0).set_min(0);         // Input image sample index
  outputImageLine.dim(0).set_bounds(0, inputImageLine.dim(0).extent()); // Output line matches input line
  Lut            .dim(0).set_bounds(0, LUT_SIZE);          //iccLut[...]: , limited number of values
  indexToLut     .dim(0).set_bounds(0, 256);    //chan4_offset[...]: value index: 256 values
}

In question Are there any restrictions with LUT: unbounded way in dimension, it is already stated that such an issue can be solved by using 'clamp' functionality.

This will change the expression to

  outputImageLine(x) = Lut(clamp(indexToLut(inputImageLine(x)), 0, LUT_SIZE));

However, the generated code shows the following expression

outputImageLine[outputImageLine.s0.x] = Lut[max(min(indexToLut[int32(inputImageLine[outputImageLine.s0.x])], 17), 0)]

I think that this means that the execution will do a min/max evaluation which can be omitted in my case, because I know that all values of indexToLut are limited to 0..16. Is there a way to avoid the execution overhead in such a case?


Solution

  • You can use unsafe_promise_clamped instead of clamp to promise that the input is bounded in the way you describe. It might not be any faster though - min and max on integer indices is very cheap compared to the indirect load.