Search code examples
cgccopenmpsimdauto-vectorization

256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?


Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.

/* "test.c" */
#include <stdlib.h>

// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {

  if (n == 0) return;
  size_t i;
  double *a = A, *b = A + n;
  double c0 = 0.0, c1 = 0.0;

  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
  for (i = 0; i < n; i++) {
    c0 += a[i];
    if (j > 0) c1 += b[i];
    }

  c[0] = c0;
  if (j > 0) c[1] = c1;

  }

#define macro_define_sum(FUN, j)            \
void FUN (size_t n, double *A, double *c) { \
  sum_template(j, n, A, c);                 \
  }

macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)

If I compile it with

gcc -O2 -mavx test.c

GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).

I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.

However, such convenience is lost if I activate OpenMP 4.0+ with

gcc -O2 -mavx -fopenmp test.c

sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).


Remark

Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.

Background

I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.


Solution

  • I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.

    I was able to find a way out, and am now posting a solution using the toy example in the question.


    I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.

    To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:

    #include <stdlib.h>
    
    // j can be 0 or 1
    #define macro_define_sum(FUN, j)                            \
    void FUN (size_t n, double *A, double *c) {                 \
      if (n == 0) return;                                       \
      size_t i;                                                 \
      double *a = A, * b = A + n;                               \
      double c0 = 0.0, c1 = 0.0;                                \
      #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
      for (i = 0; i < n; i++) {                                 \
        c0 += a[i];                                             \
        if (j > 0) c1 += b[i];                                  \
        }                                                       \
      c[0] = c0;                                                \
      if (j > 0) c[1] = c1;                                     \
      }
    
    macro_define_sum(sum_0, 0)
    macro_define_sum(sum_1, 1)
    

    In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.

    Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.

    #include <stdlib.h>
    
    #define macro_before_pragma   \
      if (n == 0) return;         \
      size_t i;                   \
      double *a = A, * b = A + n; \
      double c0 = 0.0, c1 = 0.0;
    
    #define macro_after_pragma(j) \
      for (i = 0; i < n; i++) {   \
        c0 += a[i];               \
        if (j > 0) c1 += b[i];    \
        }                         \
      c[0] = c0;                  \
      if (j > 0) c[1] = c1;
    
    void sum_0 (size_t n, double *A, double *c) {
      macro_before_pragma
      #pragma omp simd reduction (+: c0) aligned (a: 32)
      macro_after_pragma(0)
      }
    
    void sum_1 (size_t n, double *A, double *c) {
      macro_before_pragma
      #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
      macro_after_pragma(1)
      }
    

    I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.

    The compiler output is as expected in this case (Check it on Godbolt).


    Update

    Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).

    Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:

    /* "neat.c" */
    #include <stdlib.h>
    
    // stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
    #define str(s) #s
    
    // j can be 0 or 1
    #define macro_define_sum(j, alignment)                                   \
    void sum_ ## j (size_t n, double *A, double *c) {                        \
      if (n == 0) return;                                                    \
      size_t i;                                                              \
      double *a = A, * b = A + n;                                            \
      double c0 = 0.0, c1 = 0.0;                                             \
      _Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
      for (i = 0; i < n; i++) {                                              \
        c0 += a[i];                                                          \
        if (j > 0) c1 += b[i];                                               \
        }                                                                    \
      c[0] = c0;                                                             \
      if (j > 0) c[1] = c1;                                                  \
      }
    
    macro_define_sum(0, 32)
    macro_define_sum(1, 32)
    

    Other changes include:

    • I used token concatenation to generate function name;
    • alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.

    Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).


    A few comments on Peter Cordes informative answer

    Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.

    -mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?

    I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.

    OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).

    Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.


    On this Q & A

    I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.