Why does gcc -O3 handle avx256 compare intrinsic differently than gcc -O0 and clang?

I want to set two integer vectors and compare them with SIMD, and later on use this mask for a blend operation on packed floats. I produced the following code:

#include <immintrin.h>
#include <stdio.h>
#include <string.h>


int main(){
    __m256i is =  _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
    __m256i js =  _mm256_set1_epi32(1);               // integer bit-patterns
    __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);   // compare as subnormal floats

    float val[8];
    memcpy(val, &mask, sizeof(val));
    printf("%f %f %f %f %f %f %f %f \n", val[0], val[1], val[2], val[3], val[4], val[5], val[6], val[7]);
}

which works fine with gcc -mavx main.c as well as clang -mavx main.c and clang -O3 -mavx main.c.

(Editor's note: it'll break with -ffast-math when cmpps treats those denormal inputs as 0.0 so all the compares are true. You want AVX2 _mm256_cmp_epi32 to do an integer compare, and _mm256_castsi256_ps the result. But that's unrelated to the question about gcc -O0 and clang allowing implicit conversion from __m256i to __m256)

However, when I use gcc -O3 -mavx main.c I get the following error message:

main.c: In function ‘main’:
main.c:9:33: error: incompatible type for argument 1 of ‘_mm256_cmp_ps’
    9 |     __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);
      |                                 ^~
      |                                 |
      |                                 __m256i {aka __vector(4) long long int}
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:51,
                 from main.c:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/avxintrin.h:404:23: note: expected ‘__m256’ {aka ‘__vector(8) float’} but argument is of type ‘__m256i’ {aka ‘__vector(4) long long int’}
  404 | _mm256_cmp_ps (__m256 __X, __m256 __Y, const int __P)
      |                ~~~~~~~^~~
main.c:9:36: error: incompatible type for argument 2 of ‘_mm256_cmp_ps’
    9 |     __m256 mask = _mm256_cmp_ps(is,js, _CMP_EQ_OQ);
      |                                    ^~
      |                                    |
      |                                    __m256i {aka __vector(4) long long int}
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/immintrin.h:51,
                 from main.c:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/avxintrin.h:404:35: note: expected ‘__m256’ {aka ‘__vector(8) float’} but argument is of type ‘__m256i’ {aka ‘__vector(4) long long int’}
  404 | _mm256_cmp_ps (__m256 __X, __m256 __Y, const int __P)
      |                            ~~~~~~~^~~

I notice two things. First of all, the compiler seems to treat is as __m256i {aka __vector(4) long long int} whereas it contains 8 ints. Secondly, the compiler is correct to complain, because the intel intrinsics guide 1 shows the arguments as __m256. I'm now confused why this code even worked at the beginning. And if it is indeed correct because the integers are casted to floats, then I don't understand why it doesn't work with gcc -O3.

I did not want to use _mm256_cmpeq_epi32 which returns an __m256i and there (seems to be no) is no blend_ps instruction that accepts such a mask.

Why do the compilers behave differently, and what is the correct way to do this operation?

Compiler versions

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-pkgversion='Arch Linux 9.3.0-1' --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++,d --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
gcc version 9.3.0 (Arch Linux 9.3.0-1)

$ clang -v
clang version 10.0.0 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version 10.1

[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Solution

First of all, the compiler seems to treat is as __m256i {aka __vector(4) long long int} whereas it contains 8 ints.

The __m128i and larger similar vectors don't specify the actual size (and number) of integers stored in them. You can use the same __m128i type to store 16 uint8_ts or 2 uint64_ts or anything in between. The important part is that it is used to store integers. It is operations on __m128i and larger similar vectors what specifies the interpretation of the verctors as a pack of integers of a given width. For example, both _mm_add_epi16 and _mm_add_epi32 accept __m128i arguments, but the first one interprets it as a vector of 8 uint16_ts, and the second - 4 uint32_ts.

Secondly, the compiler is correct to complain, because the intel intrinsics guide 1 shows the arguments as __m256.

I think, the compiler is correct to complain. That it compiles the code with -O0 seems to be a compiler bug. In gcc, __m128i and other vectors are implemented using __attribute__((vector_size)) attributes, and the documentation says one should use __builtin_convertvector intrinsic to convert between vectors of different types.

The original definition of the __m128i and other vector types in Intel Software Developer's Manual, Section 3.1.1.10, doesn't say anything explicitly about convertibility of vectors of different types, though it does say this:

These SIMD data types are not basic Standard C data types or C++ objects, so they may be used only with the assignment operator, passed as function arguments, and returned from a function call.

Given this, I gather that these vector types are not supposed to be implicitly convertible. You certainly cannot rely on that the conversion, if it does in fact compile, will have any particular behavior. That is especially given that integer vectors don't specify the size of their elements. Therefore, you should always use an intrinsic to define the type of conversion you want, e.g. _mm_cvtepi32_ps/_mm_cvtepi32_pd or _mm_castsi128_ps/_mm_castsi128_pd.

I did not want to use _mm256_cmpeq_epi32 which returns an __m256i and there (seems to be no) is no blend_ps instruction that accepts such a mask.

_mm256_cmpeq_epi32 is AVX2, and there is _mm256_blendv_epi8 in AVX2. If you're only limited to AVX then you have to operate on 128-bit integer vectors.

Using _mm256_cmp_ps to operate on integer vectors is incorrect because its behavior is different from integer comparison. In particular, there are special rules if at least one of the input operands matches a NaN bit pattern (e.g. with _CMP_EQ_OQ operand your comparison will always return 0 in the resulting vector element).