Ergonomic loop splitting

I am looking for a more ergonomic way to do loop splitting in C++.

Motivating example

Imagine we want to blur all the rows of an image by convolving each row with the well-known {1, 4, 6, 4, 1} filter kernel. The first and last two pixels of each row have to be handled separately, because otherwise we would access out-of-bounds pixels. Here is a diagram:

Unfortunately, those boundary checks inhibit compiler optimization, so instead of writing a single loop with boundary checks (see blur_rows1 below for reference), we have to split up the loop into three separate loops (see function blur_rows2):

one loop over the first two pixels,
one loop over the middle pixels which does most of the work and does not require any boundary checks, and
one loop over the last two pixels:

// Blur row with loop splitting.
for (int x =           0; x <      radius; x++) blur(dst, src, nx, ny, x, y, kernel, radius);
for (int x =      radius; x < nx - radius; x++) blur(dst, src, nx, ny, x, y, kernel, radius);
for (int x = nx - radius; x < nx         ; x++) blur(dst, src, nx, ny, x, y, kernel, radius);

This makes blur_rows2 roughly 10 times faster than blur_rows1 (50 µs vs 500 µs for a 512x512 image).

Flawed solutions

There are multiple ways to implement loop splitting, which all have various flaws:

Implement the inner-most loop body as a function like shown below. Disadvantages:
- Can't read code from top to bottom because inner loop is implemented as a function somewhere else (bad code locality).
- Insanely large parameter lists.
- Relies on compiler to inline the function (e.g. __attribute__((always_inline)) with GCC)
- Relies on compiler to recognize that boundary checks can be removed in middle loop.
Implement inner-most loop body as a macro.
- Also suffers from lack of code locality.
- Also has to rely on compiler optimization.
- Have to add \ after every line.
- The usual disadvantages of macros (compiler errors become incomprehensible, code becomes harder to read and more difficult to reason about).
Pad src with boundary fill values so boundary check is not required.
- Increases memory requirements.
- Increases memory bandwidth cost.
https://github.com/halide/Halide
- Difficult to learn.
- Rarely compiles.

Question

It would be nice if there was compiler support for something like #pragma loop split (0, radius, nx - radius, nx) or, even better, some way to tell the compiler that it should split the loop such that the condition if (0 <= x2 && x2 < nx) goes away.

Is there something like this? I feel like it might be possible with some clever tricks like Duff's device or template magic. Solutions that only work with GCC or Clang are fine.

Example code for loop splitting

#include <cstdlib>
#include <cstdint>
#include <vector>
#include <chrono>
#include <iostream>

using std::uint8_t;

const uint32_t kernel[5] = {1, 4, 6, 4, 1};
const int radius = 2;

// Blur the pixel at (x, y) from src image of size (nx, ny) and write to dst image of same size
static inline void blur(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int x, int y, const uint32_t *kernel, int radius){
    uint32_t sum = 0;
    for (int x2 = x - radius; x2 <= x + radius; x2++){
        // Only sum pixels within image boundaries.
        // The compiler is smart enough to remove the branch, but only when doing manual loop splitting.
        if (0 <= x2 && x2 < nx){
            sum += kernel[x2 - x + radius] * src[y * nx + x2];
        }
    }
    dst[y * nx + x] = sum / 16;
}

void blur_rows1(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int ny){
    for (int y = 0; y < ny; y++){
        // Blur row without loop splitting.
        for (int x = 0; x < nx; x++){
            blur(dst, src, nx, x, y, kernel, radius);
        }
    }
}

void blur_rows2(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int ny){
    for (int y = 0; y < ny; y++){
        // Blur row with loop splitting.
        for (int x =           0; x <      radius; x++) blur(dst, src, nx, x, y, kernel, radius);
        for (int x =      radius; x < nx - radius; x++) blur(dst, src, nx, x, y, kernel, radius);
        for (int x = nx - radius; x < nx         ; x++) blur(dst, src, nx, x, y, kernel, radius);
    }
}

double sec(){
    using namespace std::chrono;
    return duration_cast<duration<double>>(steady_clock::now().time_since_epoch()).count();
}

int main(){
    int nx = 512;
    int ny = 512;
    
    std::vector<uint8_t> src(nx * ny);
    std::vector<uint8_t> dst1(nx * ny);
    std::vector<uint8_t> dst2(nx * ny);

    // Initialize src images.
    for (int i = 0; i < nx * ny; i++){
        src[i] = std::rand() & 0xff;
    }

    // Run benchmark 10 times.
    for (int i_run = 0; i_run < 10; i_run++){
        double t0 = sec();

        blur_rows1(dst1, src, nx, ny);

        double t1 = sec();

        blur_rows2(dst2, src, nx, ny);

        double t2 = sec();

        std::cout << "blur_rows1: " << (t1 - t0) * 1e6 << " µs" << std::endl;
        std::cout << "blur_rows2: " << (t2 - t1) * 1e6 << " µs" << std::endl;
        std::cout << std::endl;

        // Check that result is correct.
        for (int i = 0; i < nx * ny; i++){
            if (dst1[i] != dst2[i]) std::exit(1);
        }
    }

    return 0;
}

Solution

It would be nice if there was compiler support for something like #pragma loop split (0, radius, nx - radius, nx)

AFAIK, there is no such a thing in GCC/Clang (for C++). Note that loop splitting is implemented in both Clang/GCC, but it is brittle and it does not seems to be able to optimize your code. Here is a basic example where we can see that loop splitting is applied on both GCC/Clang. Note that there is a pragma to control loop splitting but I think it is not powerful enough to do what you want : see #pragma clang loop distribute(enable).

In HPC codes, a common solution is to operate on a bigger 2D arrays (as proposed by PepijnKramer in comments) so to avoid any expensive boundary condition. You can fill the borders with values like zeros, repeated values, average values, etc. regarding what you exactly want. Your current code make the blur fade to black on the border which is a bit weird (as opposed to using repeated values or the average). A copy can be done if the input cannot be directly created with such borders.

Increases memory requirements.

Here, you can copy the input line by line in a preallocated temporary buffer reused for all lines (very low in memory usage and cache friendly since data can fit in the L1 cache).

Increases memory bandwidth cost.

Not the one of the DRAM (which is often the main bottleneck) nor the L3 cache unless lines are really huge. A line of any reasonable size should fit in the L1 or L2 cache of all mainstream CPUs. These caches are very fast (enough so the computation should still be compute-bound).

Actually, you can copy the line AND convert items from uint8_t to uint32_t (or even uint16_t, see later) so to avoid doing 5 conversions per item during the blur. This makes the blur faster meanwhile the copy/conversion speed should be mostly left unaffected thanks to instruction-level parallelism.

This makes blur_rows2 roughly 10 times faster than blur_rows1 (50 µs vs 500 µs for a 512x512 image).

Ok, but 50 µs for this is actually not so fast. Indeed, this means 512*512/50e-6/3.5e9 ~= 1.5 items/cycle at 3.5 GHz.

The first function is inefficient mainly because the code is not vectorized. The code in the main is vectorized in GCC but not Clang, and both GCC/Clang does not vectorize the code of the two functions when they are not inlined (https://godbolt.org/z/YrzGWcGxE). This is because of inlining, constant propagation (especially radius), pointer aliasing analysis (src vs dst). The compiler does not seems to optimize the code for the specific kernel and seems to perform SIMD packing/unpacking so it is likely sub-optimal.

You can help Clang to generate a SIMD code (despite the conditional) with OpenMP and more specifically the omp simd declare (with some additional clauses) and omp simd directive. Compilers like ICC/ICX and MSVC also supports this. AFAIK, GCC currently mostly ignore such directives so they are not very useful. Alternatively, you can use #pragma clang loop vectorize(enable) on the x loop of blur1 so for Clang to vectorize it (though the generated code is not very efficient). Clang will only vectorize the code in the main function and not the non-inlined blur1 function certainly because radius is not a constant so the inner loop cannot be fully unrolled and vectorizing the outer loop is more complex in this case.

Besides, note that 32-bit operations are less efficient than 16-bit ones here (since SIMD units can compute twice more lanes per cycle with 16-bit integers than 32-bit ones on mainstream CPUs).

Higher level languages like stencil-specialized DSLs can help to generate an efficient code while supporting boundary conditions. Halide should be able to do that relatively efficiently for example. However, as you pointed out, this solution requires to learn another language (I am not aware of any compilation issue though I only used it once).

An alternative more efficient solution is to simply use a SIMD-friendly language so to ensure the code is (nearly) always vectorized properly. ISPC is one of them. It is similar to CUDA but it mainly targets CPUs. ISPC might fits your need since it is lower-level and close to the C programming language (with only few C++-related extensions like basic templates). It does not solve the loop splitting issue per se, but the presence of the boundary condition does not prevent SIMD optimizations (and so it generally does not strongly reduce the performance). It also help you to better optimize the code (better than what GCC/Clang does on the provided code) while still having a relatively readable code. Here is an (untested) ISPC code using a temporary buffer having borders and specialized to compute this specific Gaussian blur:

static inline void blur(uniform uint8 dst[], const uniform uint16 src[],
                        uniform int nx, varying int x, uniform int y)
{
    varying uint16 sum = 0;
    sum += 1 * ((uint16)src[x - 2] + (uint16)src[x + 2]);
    sum += 4 * ((uint16)src[x - 1] + (uint16)src[x + 1]);
    sum += 6 * (uint16)src[x];
    dst[y * nx + x] = sum >> 4; // Fast div by 16
}

extern "C" void blur_rows1(uniform uint8 dst[], const uniform uint8 src[],
                            uniform int nx, uniform int ny)
{
    uniform uint16* uniform tmp = uniform new uniform uint16[nx + 4];

    for (uniform int x = 0; x < 2; ++x)
        tmp[x] = tmp[2 + nx + x]= 0;

    for (uniform int y = 0; y < ny; y++)
    {
        foreach(x = 0...nx)
            tmp[x + 2] = src[x + y * nx];

        foreach(x = 0...nx)
            blur(dst, tmp + 2, nx, x, y);
    }

    delete[] tmp;
}

Put it shortly, uniform are scalar variables and varying are SIMD ones (see the ISPC documentation for more informations). This code results in a pretty efficient SIMD-accelerated main assembly computing loop (see on Godbolt):

.LBB0_6:
        vmovdqu ymm2, ymmword ptr [r10 + 2*r11 - 62]
        vmovdqu ymm3, ymmword ptr [r10 + 2*r11 - 30]
        vpaddw  ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 66]
        vpaddw  ymm3, ymm3, ymmword ptr [r10 + 2*r11 - 34]
        vpsllw  ymm3, ymm3, 2
        vpsllw  ymm2, ymm2, 2
        vpaddw  ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 60]
        vpaddw  ymm3, ymm3, ymmword ptr [r10 + 2*r11 - 28]
        vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11 - 64]
        vpaddw  ymm2, ymm2, ymm4
        vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11 - 32]
        vpaddw  ymm3, ymm3, ymm4
        vpsrlw  ymm2, ymm2, 4
        vpsrlw  ymm3, ymm3, 4
        vpand   ymm3, ymm3, ymm1
        vpand   ymm2, ymm2, ymm1
        vpackuswb       ymm2, ymm2, ymm3
        vpermq  ymm2, ymm2, 216                 # ymm2 = ymm2[0,2,1,3]
        vmovdqu ymmword ptr [r9 + r11 - 32], ymm2
        vmovdqu ymm2, ymmword ptr [r10 + 2*r11 + 2]
        vmovdqu ymm3, ymmword ptr [r10 + 2*r11 + 34]
        vpaddw  ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 2]
        vpaddw  ymm3, ymm3, ymmword ptr [r10 + 2*r11 + 30]
        vpsllw  ymm3, ymm3, 2
        vpsllw  ymm2, ymm2, 2
        vpaddw  ymm2, ymm2, ymmword ptr [r10 + 2*r11 + 4]
        vpaddw  ymm3, ymm3, ymmword ptr [r10 + 2*r11 + 36]
        vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11]
        vpmullw ymm5, ymm0, ymmword ptr [r10 + 2*r11 + 32]
        vpaddw  ymm2, ymm2, ymm4
        vpaddw  ymm3, ymm3, ymm5
        vpsrlw  ymm2, ymm2, 4
        vpsrlw  ymm3, ymm3, 4
        vpand   ymm3, ymm3, ymm1
        vpand   ymm2, ymm2, ymm1
        vpackuswb       ymm2, ymm2, ymm3
        vpermq  ymm2, ymm2, 216                 # ymm2 = ymm2[0,2,1,3]
        vmovdqu ymmword ptr [r9 + r11], ymm2
        add     r11, 64
        add     r12, -2
        jne     .LBB0_6

I expect this to be at least twice faster than blur2 while being simpler though more specialized. Even a more generic ISPC implementation should actually be still significantly faster than blur2.

By the way, the blur function in the ISPC code is not required. If you prefer to write one big function, then this is fine too (see a minimal code on Godbolt).