gcc compiler-optimization avx intel-fortran avx2

Vectorization with GCC and GFORTRAN

I have a trivial loop which I am expecting to see YMM registers in the assembly, but am only seeing XMM

program loopunroll
integer i
double precision x(8)
do i=1,8
   x(i) = dble(i) + 5.0d0
enddo
end program loopunroll

Then I compile it (gcc or gfortran does not matter. I am using gcc 8.1.0)

[user@machine avx]$ gfortran -S -mavx loopunroll.f90
[user@machine avx]$ cat loopunroll.f90|grep mm
[user@machine avx]$ cat loopunroll.s|grep mm
    vcvtsi2sd       -4(%rbp), %xmm0, %xmm0
    vmovsd  .LC0(%rip), %xmm1
    vaddsd  %xmm1, %xmm0, %xmm0
    vmovsd  %xmm0, -80(%rbp,%rax,8)

But if I do this will intel parallel studio 2018 update3:

[user@machine avx]$ ifort -S -mavx loopunroll.f90
[user@machine avx]$ cat loopunroll.s|grep mm                                                 vmovdqu   .L_2il0floatpacket.0(%rip), %xmm2             #11.8
    vpaddd    .L_2il0floatpacket.2(%rip), %xmm2, %xmm3      #11.15
    vmovupd   .L_2il0floatpacket.1(%rip), %ymm4             #11.23
    vcvtdq2pd %xmm2, %ymm0                                  #11.15
    vcvtdq2pd %xmm3, %ymm5                                  #11.15
    vaddpd    %ymm0, %ymm4, %ymm1                           #11.8
    vaddpd    %ymm5, %ymm4, %ymm6                           #11.8
    vmovupd   %ymm1, loopunroll_$X.0.1(%rip)                #11.8
    vmovupd   %ymm6, 32+loopunroll_$X.0.1(%rip)             #11.8

I have also tried the flags -march=core-avx2 -mtune=core-avx2 for both gnu and intel and I still get the same result of XMM in the gnu-produced assembly, but YMM in the intel-produced assembly

What should I be doing differently please folks?

Many thanks, M

Solution

You forgot to enable optimization with gfortran. Use gfortran -O3 -march=native.

For that to not optimize away entirely, write a function (subroutine) that produces a result that code outside that subroutine can see. e.g. take x as an argument and store it. The compiler will have to emit asm that works for any caller, including one that cares about the contents of the array after calling the subroutine on it.

For gcc, -ftree-vectorize is only enabled at -O3, not -O2.

The gcc default is -O0, i.e. compile fast and make terribly slow code that gives consistent debugging.

gcc will never auto-vectorize at -O0. You must use -O3 or -O2 -ftree-vectorize.

The ifort default apparently includes optimization, unlike gcc. You should not expect ifort -S and gcc -S output to be remotely similar if you don't use -O3 for gcc.

when I use -O3 it throws away any reference to both XMM and YMM in the assembly.

It's a good thing when compilers optimize away useless work.

Write a function that takes an array input arg and writes an output arg, and look at asm for that function. Or a function that operates on two global arrays. Not a whole program, because compilers have whole-program optimization.

Anyway, see How to remove "noise" from GCC/clang assembly output? for tips on writing useful functions for looking at compiler asm output. That's a C Q&A but all the advice applies to Fortran as well: write functions that take args and return a result or have a side effect that can't optimize away.

http://godbolt.org/ doesn't have Fortran, and it looks like -xfortran doesn't work to make g++ compile as fortran. (-xc works to compile as C instead of C++ on Godbolt, though.) Otherwise I'd recommend that tool for looking at compiler output.

I made a C version of your loop to see what gcc does for presumably similar input to its optimizer. (I don't have gfortran 8.1 installed, and I barely know Fortran. I'm here for the AVX and optimization tags, but gfortran uses the same backend as gcc which I am very familiar with.)

void store_i5(double *x) {
    for(int i=0 ; i<512; i++) {
        x[i] = 5.0 + i;
    }
}

With i<8 as the loop condition, gcc -O3 -march=haswell and clang sensibly optimize the function to just copy 8 doubles from static constants, with vmovupd. Increasing the array size, gcc fully unrolls a copy for surprisingly large sizes, up to 143 doubles. But for 144 or more, it makes a loop that actually calculates. There's probably a tuning parameter somewhere to control this heuristic. BTW, clang fully unrolls a copy even for 256 doubles, with -O3 -march=haswell. But 512 is large enough that both gcc and clang make loops that calculate.

gcc8.1's inner loop (with -O3 -march=haswell) looks like this, using -masm=intel. (See source+asm on the Godbolt compiler explorer).

    vmovdqa ymm1, YMMWORD PTR .LC0[rip]  # [0,1,2,3,4,5,6,7]
    vmovdqa ymm3, YMMWORD PTR .LC1[rip]  # set1_epi32(8)
    lea     rax, [rdi+4096]              # rax = endp
    vmovapd ymm2, YMMWORD PTR .LC2[rip]  # set1_pd(5.0)

.L2:                                   # do {
    vcvtdq2pd       ymm0, xmm1              # packed convert 4 elements to double
    vaddpd  ymm0, ymm0, ymm2                # +5.0
    add     rdi, 64
    vmovupd YMMWORD PTR [rdi-64], ymm0      # store x[i+0..3]
    vextracti128    xmm0, ymm1, 0x1
    vpaddd  ymm1, ymm1, ymm3                # [i0, i1, i2, ..., i7] += 8 packed 32-bit integer add (d=dword)
    vcvtdq2pd       ymm0, xmm0              # convert the high 4 elements
    vaddpd  ymm0, ymm0, ymm2
    vmovupd YMMWORD PTR [rdi-32], ymm0
    cmp     rax, rdi
    jne     .L2                        # }while(p < endp);

We can defeat constant propagation for a small array by using an offset, so the values to be stored are not a compile-time constant anymore:

void store_i5_var(double *x, int offset) {
    for(int i=0 ; i<8; i++) {
        x[i] = 5.0 + (i + offset);
    }
}

gcc uses basically the same loop body as above, with a bit of setup but the same vector constants.

Tuning options:

gcc -O3 -march=native on some targets will prefer auto-vectorizing with 128-bit vectors, so you still won't get YMM registers. You can use -march=native -mprefer-vector-width=256 to override that. (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). (Or with gcc7 and earlier, -mno-prefer-avx128`.)

gcc prefers 256-bit for -march=haswell because the execution units are fully 256-bit, and it has efficient 256-bit loads/stores.

Bulldozer and Zen split 256-bit instructions into two 128-bit internally, so it can actually be faster to run twice as many XMM instructions, especially if your data isn't always aligned by 32. Or when scalar prologue / epilogue overhead is relevant. Definitely benchmark both ways if you're using an AMD CPU. Or actually for any CPU it's not a bad idea.

Also in this case, gcc doesn't realize that it should use XMM vectors of integers and YMM vectors of doubles. (Clang and ICC are better at mixing different vector widths when appropriate). Instead it extracts the high 128 of a YMM vector of integers every time. So one reason that 128-bit vectorization sometimes wins is that sometimes gcc shoots itself in the foot when doing 256-bit vectorization. (gcc's auto-vectorization is often clumsy with types that aren't all the same width.)

With -march=znver1 -mno-prefer-avx128, gcc8.1 does the stores to memory with two 128-bit halves, because it doesn't know if the destination is 32-byte aligned or not (https://godbolt.org/g/A66Egm). tune=znver1 sets -mavx256-split-unaligned-store. You can override that with -mno-avx256-split-unaligned-store, e.g. if your arrays usually are aligned but you haven't given the compiler enough information.