c++ coroutine runs avx SIMD code, but causes SIGSEGV for AVX2 and AVX512
#define AVX512 0
#define AVX2 1
#define SSE 0
HelloCoroutine hello(int& index, int id, int group_size) {
unsigned res=0;
#if AVX512
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake -mavx512f
// segment fault
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m512i v_offset = _mm512_set1_epi64(int64_t (i));
std::cout <<"step 2" <<std::endl;
__m512i v_size = _mm512_set1_epi64(int64_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm512_cmpgt_epi64_mask(v_offset, v_size);
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif AVX2
// g++ simd.cpp -std=gnu++20 -fcoroutines -O2 -march=skylake
// only specify `-O2 -march=skylake` and runs ok on local machine, otherwise segment fault (also on godbolt)
for(auto i= index++; i< 20; i=index++)
{
std::cout <<"step 1" <<std::endl;
__m256i v_offset = _mm256_set1_epi32(int32_t (i));
std::cout <<"step 2" <<std::endl;
__m256i v_size = _mm256_set1_epi32(int32_t(group_size));
std::cout <<"step 3" <<std::endl;
res = _mm256_movemask_epi8(_mm256_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << (int)res<<endl;
co_await std::suspend_always();
}
#elif SSE
for(auto i= index++; i< 20; i=index++)
{
__m128i v_offset = _mm_set1_epi32(int32_t (i));
__m128i v_size = _mm_set1_epi32(int32_t(group_size));
res = _mm_movemask_epi8(_mm_cmpgt_epi32(v_offset, v_size));
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#else
for(auto i= index++; i< 20; i=index++)
{
res = i > group_size;
cout <<i << " > " << group_size <<" ? " << res<<endl;
co_await std::suspend_always();
}
#endif
}
compile at https://godbolt.org/z/h3hej1ddq
-std=c++20 -fcoroutines -mbmi2 -mavx -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl
but result error for avx and avx512, only SSE works OK
Program returned: 139 Program terminated with signal: SIGSEGV step 1
but it works on on clang-16 -std=gnu++20 -O2 -march=skylake -mavx512f
https://godbolt.org/z/nMfbn8G9T
This seems to be a GCC bug, unless coroutines are documented to not support local variables with alignof(T) > alignof(max_align_t)
(Such as __m256i
or __m512i
).
You can report it (preferably with a minimal AVX2 test case) to https://gcc.gnu.org/bugzilla/
With a version that only requires AVX2 instead of AVX-512, I could test it on my desktop and see it faults on vmovdqa YMMWORD PTR [rbx+0x40],ymm0
which requires 32-byte alignment. (Storing the result of a vpbroadcastd
, initializing __m256i v_offset = set1...
.) (https://godbolt.org/z/8vfz3v5v1 just fixes the __m256i
block, compiles with -std=gnu++20 -fcoroutines -O2 -march=skylake
)
IDK why it's using RBX to access locals instead of RSP; I guess that's how coroutines work in the hello(hello(int&, int, int)::_Z5helloRiii.Frame*) [clone .actor]:
version of the function. In that coroutine version, GCC still just aligns the stack pointer with and rsp, -32
/ sub rsp, 192
, but that doesn't help for things stored relative to RBX.
Note that all 3 of your versions require AVX-512, just with different vector widths. Compare-into-mask like _mm_cmpgt_epi32_mask
always requires AVX-512.
If you want an integer mask with AVX2 or SSE, you need _mm_cmpgt_epi32
and _mm_movemask_epi8
(1 bit per byte) or _mm_movemask_ps( _mm_castsi128_ps(cmp_result) )
(1 bit per int32), or the _mm256
equivalent.
Use -march=native
or -march=skylake-avx512
, -march=znver4
, or whatever. No real CPUs have ever supported both AVX512ER (Xeon Phi) and AVX512VL (everything else). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512
If your CPU didn't support AVX-512, you'd get SIGILL (on all 3), not SIGSEGV.