How can you get gcc to fully vectorize this sqrt loop?

If I take this code

#include <cmath>

void compute_sqrt(const double* x, double* y, int n) {
  int i;
#pragma omp simd linear(i)
  for (i=0; i<n; ++i) {
    y[i] = std::sqrt(x[i]);
  }
}

and compile with g++ -S -c -O3 -fopenmp-simd -march=cascadelake, then I get instructions like this in the loop (compiler-explorer)

...
  vsqrtsd %xmm0, %xmm0, %xmm0
...

XMMs are 128 bit registers but cascadelake supports avx-512. Is there a way to get gcc to use 256 (YMM) or 512 bit (ZMM) registers?

By comparison, ICC defaults to use 256 registers for cascadelake: Compiling with icc -c -S -O3 -march=cascadelake -qopenmp-simd produces (compiler-explorer)

...
  vsqrtpd 32(%rdi,%r9,8), %ymm1 #7.12
...

and you can add the option -qopt-zmm-usage=high to use 512-bit registers (compiler-explorer)

...
  vrsqrt14pd %zmm4, %zmm1 #7.12
...

Solution

XMMs are 128 bit registers

It's worse than that, vsqrtsd is not even a vector operation, as indicated by the sd on the end (scalar, double precision). XMM registers are also used by scalar floating point operations like that, but only the low 64 or 32 bits of the register contain useful data, the rest is zeroed out.

The missing options are -fno-math-errno (this flag is also implied by -ffast-math, which has additional effects) and (optionally) -mprefer-vector-width=512.

-fno-math-errno turns off setting errno for math operations, in particular for square roots this means a negative input results in NaN without setting errno to EDOM. ICC apparently does not care about that by default.

-mprefer-vector-width=512 makes autovectorization prefer 512bit operations when they make sense. By default, 256bit operations are preferred, at least for cascadelake and skylake-avx512 and other current processors, it probably won't stay that way for all future processors.