I am looking for a more ergonomic way to do loop splitting in C++.
Imagine we want to blur all the rows of an image by convolving each row with the well-known {1, 4, 6, 4, 1}
filter kernel. The first and last two pixels of each row have to be handled separately, because otherwise we would access out-of-bounds pixels. Here is a diagram:
Unfortunately, those boundary checks inhibit compiler optimization, so instead of writing a single loop with boundary checks (see blur_rows1
below for reference), we have to split up the loop into three separate loops (see function blur_rows2
):
// Blur row with loop splitting.
for (int x = 0; x < radius; x++) blur(dst, src, nx, ny, x, y, kernel, radius);
for (int x = radius; x < nx - radius; x++) blur(dst, src, nx, ny, x, y, kernel, radius);
for (int x = nx - radius; x < nx ; x++) blur(dst, src, nx, ny, x, y, kernel, radius);
This makes blur_rows2
roughly 10 times faster than blur_rows1
(50 µs vs 500 µs for a 512x512 image).
There are multiple ways to implement loop splitting, which all have various flaws:
__attribute__((always_inline))
with GCC)\
after every line.src
with boundary fill values so boundary check is not required.
It would be nice if there was compiler support for something like #pragma loop split (0, radius, nx - radius, nx)
or, even better, some way to tell the compiler that it should split the loop such that the condition if (0 <= x2 && x2 < nx)
goes away.
Is there something like this? I feel like it might be possible with some clever tricks like Duff's device or template magic. Solutions that only work with GCC or Clang are fine.
#include <cstdlib>
#include <cstdint>
#include <vector>
#include <chrono>
#include <iostream>
using std::uint8_t;
const uint32_t kernel[5] = {1, 4, 6, 4, 1};
const int radius = 2;
// Blur the pixel at (x, y) from src image of size (nx, ny) and write to dst image of same size
static inline void blur(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int x, int y, const uint32_t *kernel, int radius){
uint32_t sum = 0;
for (int x2 = x - radius; x2 <= x + radius; x2++){
// Only sum pixels within image boundaries.
// The compiler is smart enough to remove the branch, but only when doing manual loop splitting.
if (0 <= x2 && x2 < nx){
sum += kernel[x2 - x + radius] * src[y * nx + x2];
}
}
dst[y * nx + x] = sum / 16;
}
void blur_rows1(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int ny){
for (int y = 0; y < ny; y++){
// Blur row without loop splitting.
for (int x = 0; x < nx; x++){
blur(dst, src, nx, x, y, kernel, radius);
}
}
}
void blur_rows2(std::vector<uint8_t> &dst, const std::vector<uint8_t> &src, int nx, int ny){
for (int y = 0; y < ny; y++){
// Blur row with loop splitting.
for (int x = 0; x < radius; x++) blur(dst, src, nx, x, y, kernel, radius);
for (int x = radius; x < nx - radius; x++) blur(dst, src, nx, x, y, kernel, radius);
for (int x = nx - radius; x < nx ; x++) blur(dst, src, nx, x, y, kernel, radius);
}
}
double sec(){
using namespace std::chrono;
return duration_cast<duration<double>>(steady_clock::now().time_since_epoch()).count();
}
int main(){
int nx = 512;
int ny = 512;
std::vector<uint8_t> src(nx * ny);
std::vector<uint8_t> dst1(nx * ny);
std::vector<uint8_t> dst2(nx * ny);
// Initialize src images.
for (int i = 0; i < nx * ny; i++){
src[i] = std::rand() & 0xff;
}
// Run benchmark 10 times.
for (int i_run = 0; i_run < 10; i_run++){
double t0 = sec();
blur_rows1(dst1, src, nx, ny);
double t1 = sec();
blur_rows2(dst2, src, nx, ny);
double t2 = sec();
std::cout << "blur_rows1: " << (t1 - t0) * 1e6 << " µs" << std::endl;
std::cout << "blur_rows2: " << (t2 - t1) * 1e6 << " µs" << std::endl;
std::cout << std::endl;
// Check that result is correct.
for (int i = 0; i < nx * ny; i++){
if (dst1[i] != dst2[i]) std::exit(1);
}
}
return 0;
}
It would be nice if there was compiler support for something like
#pragma loop split (0, radius, nx - radius, nx)
AFAIK, there is no such a thing in GCC/Clang (for C++). Note that loop splitting is implemented in both Clang/GCC, but it is brittle and it does not seems to be able to optimize your code. Here is a basic example where we can see that loop splitting is applied on both GCC/Clang. Note that there is a pragma to control loop splitting but I think it is not powerful enough to do what you want : see #pragma clang loop distribute(enable)
.
In HPC codes, a common solution is to operate on a bigger 2D arrays (as proposed by PepijnKramer in comments) so to avoid any expensive boundary condition. You can fill the borders with values like zeros, repeated values, average values, etc. regarding what you exactly want. Your current code make the blur fade to black on the border which is a bit weird (as opposed to using repeated values or the average). A copy can be done if the input cannot be directly created with such borders.
Increases memory requirements.
Here, you can copy the input line by line in a preallocated temporary buffer reused for all lines (very low in memory usage and cache friendly since data can fit in the L1 cache).
Increases memory bandwidth cost.
Not the one of the DRAM (which is often the main bottleneck) nor the L3 cache unless lines are really huge. A line of any reasonable size should fit in the L1 or L2 cache of all mainstream CPUs. These caches are very fast (enough so the computation should still be compute-bound).
Actually, you can copy the line AND convert items from uint8_t
to uint32_t
(or even uint16_t
, see later) so to avoid doing 5 conversions per item during the blur. This makes the blur faster meanwhile the copy/conversion speed should be mostly left unaffected thanks to instruction-level parallelism.
This makes blur_rows2 roughly 10 times faster than blur_rows1 (50 µs vs 500 µs for a 512x512 image).
Ok, but 50 µs for this is actually not so fast. Indeed, this means 512*512/50e-6/3.5e9 ~= 1.5
items/cycle at 3.5 GHz.
The first function is inefficient mainly because the code is not vectorized. The code in the main is vectorized in GCC but not Clang, and both GCC/Clang does not vectorize the code of the two functions when they are not inlined (https://godbolt.org/z/YrzGWcGxE). This is because of inlining, constant propagation (especially radius
), pointer aliasing analysis (src
vs dst
). The compiler does not seems to optimize the code for the specific kernel and seems to perform SIMD packing/unpacking so it is likely sub-optimal.
You can help Clang to generate a SIMD code (despite the conditional) with OpenMP and more specifically the omp simd declare
(with some additional clauses) and omp simd
directive. Compilers like ICC/ICX and MSVC also supports this. AFAIK, GCC currently mostly ignore such directives so they are not very useful. Alternatively, you can use #pragma clang loop vectorize(enable)
on the x
loop of blur1
so for Clang to vectorize it (though the generated code is not very efficient). Clang will only vectorize the code in the main
function and not the non-inlined blur1
function certainly because radius
is not a constant so the inner loop cannot be fully unrolled and vectorizing the outer loop is more complex in this case.
Besides, note that 32-bit operations are less efficient than 16-bit ones here (since SIMD units can compute twice more lanes per cycle with 16-bit integers than 32-bit ones on mainstream CPUs).
Higher level languages like stencil-specialized DSLs can help to generate an efficient code while supporting boundary conditions. Halide should be able to do that relatively efficiently for example. However, as you pointed out, this solution requires to learn another language (I am not aware of any compilation issue though I only used it once).
An alternative more efficient solution is to simply use a SIMD-friendly language so to ensure the code is (nearly) always vectorized properly. ISPC is one of them. It is similar to CUDA but it mainly targets CPUs. ISPC might fits your need since it is lower-level and close to the C programming language (with only few C++-related extensions like basic templates). It does not solve the loop splitting issue per se, but the presence of the boundary condition does not prevent SIMD optimizations (and so it generally does not strongly reduce the performance). It also help you to better optimize the code (better than what GCC/Clang does on the provided code) while still having a relatively readable code. Here is an (untested) ISPC code using a temporary buffer having borders and specialized to compute this specific Gaussian blur:
static inline void blur(uniform uint8 dst[], const uniform uint16 src[],
uniform int nx, varying int x, uniform int y)
{
varying uint16 sum = 0;
sum += 1 * ((uint16)src[x - 2] + (uint16)src[x + 2]);
sum += 4 * ((uint16)src[x - 1] + (uint16)src[x + 1]);
sum += 6 * (uint16)src[x];
dst[y * nx + x] = sum >> 4; // Fast div by 16
}
extern "C" void blur_rows1(uniform uint8 dst[], const uniform uint8 src[],
uniform int nx, uniform int ny)
{
uniform uint16* uniform tmp = uniform new uniform uint16[nx + 4];
for (uniform int x = 0; x < 2; ++x)
tmp[x] = tmp[2 + nx + x]= 0;
for (uniform int y = 0; y < ny; y++)
{
foreach(x = 0...nx)
tmp[x + 2] = src[x + y * nx];
foreach(x = 0...nx)
blur(dst, tmp + 2, nx, x, y);
}
delete[] tmp;
}
Put it shortly, uniform
are scalar variables and varying
are SIMD ones (see the ISPC documentation for more informations). This code results in a pretty efficient SIMD-accelerated main assembly computing loop (see on Godbolt):
.LBB0_6:
vmovdqu ymm2, ymmword ptr [r10 + 2*r11 - 62]
vmovdqu ymm3, ymmword ptr [r10 + 2*r11 - 30]
vpaddw ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 66]
vpaddw ymm3, ymm3, ymmword ptr [r10 + 2*r11 - 34]
vpsllw ymm3, ymm3, 2
vpsllw ymm2, ymm2, 2
vpaddw ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 60]
vpaddw ymm3, ymm3, ymmword ptr [r10 + 2*r11 - 28]
vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11 - 64]
vpaddw ymm2, ymm2, ymm4
vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11 - 32]
vpaddw ymm3, ymm3, ymm4
vpsrlw ymm2, ymm2, 4
vpsrlw ymm3, ymm3, 4
vpand ymm3, ymm3, ymm1
vpand ymm2, ymm2, ymm1
vpackuswb ymm2, ymm2, ymm3
vpermq ymm2, ymm2, 216 # ymm2 = ymm2[0,2,1,3]
vmovdqu ymmword ptr [r9 + r11 - 32], ymm2
vmovdqu ymm2, ymmword ptr [r10 + 2*r11 + 2]
vmovdqu ymm3, ymmword ptr [r10 + 2*r11 + 34]
vpaddw ymm2, ymm2, ymmword ptr [r10 + 2*r11 - 2]
vpaddw ymm3, ymm3, ymmword ptr [r10 + 2*r11 + 30]
vpsllw ymm3, ymm3, 2
vpsllw ymm2, ymm2, 2
vpaddw ymm2, ymm2, ymmword ptr [r10 + 2*r11 + 4]
vpaddw ymm3, ymm3, ymmword ptr [r10 + 2*r11 + 36]
vpmullw ymm4, ymm0, ymmword ptr [r10 + 2*r11]
vpmullw ymm5, ymm0, ymmword ptr [r10 + 2*r11 + 32]
vpaddw ymm2, ymm2, ymm4
vpaddw ymm3, ymm3, ymm5
vpsrlw ymm2, ymm2, 4
vpsrlw ymm3, ymm3, 4
vpand ymm3, ymm3, ymm1
vpand ymm2, ymm2, ymm1
vpackuswb ymm2, ymm2, ymm3
vpermq ymm2, ymm2, 216 # ymm2 = ymm2[0,2,1,3]
vmovdqu ymmword ptr [r9 + r11], ymm2
add r11, 64
add r12, -2
jne .LBB0_6
I expect this to be at least twice faster than blur2
while being simpler though more specialized. Even a more generic ISPC implementation should actually be still significantly faster than blur2
.
By the way, the blur
function in the ISPC code is not required. If you prefer to write one big function, then this is fine too (see a minimal code on Godbolt).