Consider the following toy example, where A
is an n x 2
matrix stored in column-major order and I want to compute its column sum. sum_0
only computes sum of the 1st column, while sum_1
does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0
to j
). It is constructed to demonstrate the template problem I have in reality.
/* "test.c" */
#include <stdlib.h>
// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
if (n == 0) return;
size_t i;
double *a = A, *b = A + n;
double c0 = 0.0, c1 = 0.0;
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
for (i = 0; i < n; i++) {
c0 += a[i];
if (j > 0) c1 += b[i];
}
c[0] = c0;
if (j > 0) c[1] = c1;
}
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
sum_template(j, n, A, c); \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
If I compile it with
gcc -O2 -mavx test.c
GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1
for function sum_0
(Check it on Godbolt).
I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.
However, such convenience is lost if I activate OpenMP 4.0+ with
gcc -O2 -mavx -fopenmp test.c
sum_template
is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx
to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).
Remark
Using GCC's auto-vectorization -ftree-vectorize -ffast-math
would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.
Background
I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.
I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.
I was able to find a way out, and am now posting a solution using the toy example in the question.
I planed to utilize an inline function sum_template
for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.
To get rid of the inline function sum_template
, I manually inline it in the macro macro_define_sum
:
#include <stdlib.h>
// j can be 0 or 1
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
In this macro-only version, j
is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c)
or sum_template(1, n, a, b, c)
at pre-processing stage, and j
in the body of sum_template
is only propagated at the later compile time.
Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with #
is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.
#include <stdlib.h>
#define macro_before_pragma \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0;
#define macro_after_pragma(j) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1;
void sum_0 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0) aligned (a: 32)
macro_after_pragma(0)
}
void sum_1 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
macro_after_pragma(1)
}
I no long need macro_define_sum
. I can define sum_0
and sum_1
straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.
The compiler output is as expected in this case (Check it on Godbolt).
Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).
Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma
is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma
operator:
/* "neat.c" */
#include <stdlib.h>
// stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
#define str(s) #s
// j can be 0 or 1
#define macro_define_sum(j, alignment) \
void sum_ ## j (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
_Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(0, 32)
macro_define_sum(1, 32)
Other changes include:
alignment
is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)
) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma
requires.Use gcc -E neat.c
to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).
Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.
-mavx256-split-unaligned-load
is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load
is for GCC; what is it for other compilers?
I am aware of the difference between static inline
and inline
. If an inline
function is only accessed by one file, I will declare it as static
so that compiler does not generate a copy of it.
OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math
. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).
Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops
to ask GCC for loop unrolling, using several accumulators to enhance throughput.
I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.