Running Halide generators from cmake with the most optimal compiler flags and configurations

OK, so: I have successfully integrated the first working Halide generator into the cmake build system for my little image-processing project.

The generator implements an image-resizing and -resampling algorithm, based on the example code from the Halide codebase – Halide/apps/resize/resize.cpp – I adapted the sample in order to leverage generator parameters, and tied the generators’ compilation and invocation to my cmake script using the functions defined in HalideGenerator.cmake, just as the Halide project does in its own build script.

All this works great, so far – but my domain expertise is lacking in the realm of code-generation nuances. For example, I tweaked the scheduling method to get the best observed empirical speed on my laptop – but despite taking many long tinkering sessions and code-reading sojourns into the depths of Halide’s many generator-related tools and scripts, I have only the most superficial understanding of the code-generation process.

Specifically, I don’t know how to approach this. Is it best to use defaults or try to turn on specific options for my target platform – and if the latter, do I have to have conditional code somewhere, or can the binary include fallbacks?

Here’s what I am talking about: in the source for Halide tutorial lesson #15, there’s a complex script that invokes a generator with various options. Here’s a snippet from code comments in this script:

# If you're compiling and linking multiple Halide pipelines, then the
# multiple copies of the runtime should combine into a single copy
# (via weak linkage). If you're compiling and linking for multiple
# different targets (e.g. avx and non-avx), then the runtimes might be
# different, and you can't control which copy of the runtime the
# linker selects.

# You can control this behavior explicitly by compiling your pipelines
# with the no_runtime target flag. Let's generate and link several
# different versions of the first pipeline for different x86 variants: [snip]

… from this it is hard to separate what must be done, from what should be done, or what may be done, discretionally. Comparatively, one doesn’t have to deal with these issues when setting up C++ or Objective-C projects (even more Byzantine examples) as the compiler and linker make most these decisions for you, and at most need a flag or two.

My question is: how can I integrate the Halide generator’s output library binaries into my existing project – such that the generator output is as fast as possible (e.g. uses GPU, SSE2/3, AVX2 etc) without further constraining portability (e.g. it won’t mysteriously segfault on a slightly different machine)?

Specifically, what should my process be – as in, should I only target the lowest-common-denominator at first, and then leverage more exotic processor features incrementally?

EDIT: As I mentioned in comments below, this is what my GenGen binary outputs to stdout when invoked with no options:

Imgur

Solution

For the case of pre-generating your binaries (AOT), it sounds like you want runtime dispatch. Your program will examine the CPU/GPU environment at startup and decide what features (AVX, OpenCL, etc.) should be used. This is not Halide specific.

Select a set of advanced features to target (high powered desktop GPU) as a best case and a set of minimal features that will work on every machine (SSE2 only).
Build a DLL/dylib/so for each of these feature sets that contains every performance hungry function. These can be scheduled differently or even built with completely different Func definitions. You can have both sets in the same source code file and test the Target object at generation time to choose between them.
At program startup, see if your best case features are present and, if so, load that library and use it. If any features are missing, default to the most compatible version.

You are free to choose how many feature sets and libraries you want to support.

The alternative is to compile your Halide code at program startup (JIT). I recommend using the Target object returned by get_jit_target_from_environment(), which uses the contents of the environment variable HL_JIT_TARGET or "host" if that variable is not set. The "host" target string is the same as get_host_target() and means Halide will examine the CPU/GPU environment and set whatever features it finds. You can then dynamically test the Target object and use GPU or CPU scheduling.