gcc c++ coroutine runs avx SIMD code, but causes SIGSEGV

c++ coroutine runs avx SIMD code, but causes SIGSEGV for AVX2 and AVX512

#define AVX512 0
#define AVX2 1
#define SSE 0

HelloCoroutine hello(int& index, int id, int group_size) {
    unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m512i v_offset = _mm512_set1_epi64(int64_t (i));
        std::cout <<"step 2" <<std::endl;
        __m512i v_size = _mm512_set1_epi64(int64_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif AVX2 
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
    for(auto i= index++; i< 20; i=index++)
    {
        std::cout <<"step 1" <<std::endl;
        __m256i v_offset = _mm256_set1_epi32(int32_t (i));
        std::cout <<"step 2" <<std::endl;
        __m256i v_size = _mm256_set1_epi32(int32_t(group_size));
        std::cout <<"step 3" <<std::endl;
        res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
        co_await std::suspend_always();
    }
#elif SSE
    for(auto i= index++; i< 20; i=index++)
    {
        __m128i v_offset = _mm_set1_epi32(int32_t (i));
        __m128i v_size = _mm_set1_epi32(int32_t(group_size));
        res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }    
#else
    for(auto i= index++; i< 20; i=index++)
    {
        res = i > group_size;
        cout <<i << " > " << group_size <<" ? " << res<<endl;
        co_await std::suspend_always();
    }
#endif
}

compile at https://godbolt.org/z/h3hej1ddq

-std=c++20 -fcoroutines -mbmi2 -mavx -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl

but result error for avx and avx512, only SSE works OK

Program returned: 139 Program terminated with signal: SIGSEGV step 1

but it works on on clang-16 -std=gnu++20 -O2 -march=skylake -mavx512f https://godbolt.org/z/nMfbn8G9T

Solution

This seems to be a GCC bug, unless coroutines are documented to not support local variables with alignof(T) > alignof(max_align_t) (Such as __m256i or __m512i).

You can report it (preferably with a minimal AVX2 test case) to https://gcc.gnu.org/bugzilla/

With a version that only requires AVX2 instead of AVX-512, I could test it on my desktop and see it faults on vmovdqa YMMWORD PTR [rbx+0x40],ymm0 which requires 32-byte alignment. (Storing the result of a vpbroadcastd, initializing __m256i v_offset = set1....) (https://godbolt.org/z/8vfz3v5v1 just fixes the __m256i block, compiles with -std=gnu++20 -fcoroutines -O2 -march=skylake)

IDK why it's using RBX to access locals instead of RSP; I guess that's how coroutines work in the hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]: version of the function. In that coroutine version, GCC still just aligns the stack pointer with and rsp, -32 / sub rsp, 192, but that doesn't help for things stored relative to RBX.

Note that all 3 of your versions require AVX-512, just with different vector widths. Compare-into-mask like _mm_cmpgt_epi32_mask always requires AVX-512.

If you want an integer mask with AVX2 or SSE, you need _mm_cmpgt_epi32 and _mm_movemask_epi8 (1 bit per byte) or _mm_movemask_ps( _mm_castsi128_ps(cmp_result) ) (1 bit per int32), or the _mm256 equivalent.

Use -march=native or -march=skylake-avx512, -march=znver4, or whatever. No real CPUs have ever supported both AVX512ER (Xeon Phi) and AVX512VL (everything else). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

If your CPU didn't support AVX-512, you'd get SIGILL (on all 3), not SIGSEGV.