Gcc misoptimises sse function

I'm converting a project to compile with gcc from clang and I've ran into a issue with a function that uses sse functions:

void dodgy_function(
    const short* lows,
    const short* highs,
    short* mins,
    short* maxs,
    int its
)
{
    __m128i v00[2] = { _mm_setzero_si128(), _mm_setzero_si128() }; 
    __m128i v10[2] = { _mm_setzero_si128(), _mm_setzero_si128() };

    for (int i = 0; i < its; ++i) {
        reinterpret_cast<short*>(v00)[i] = lows[i];
        reinterpret_cast<short*>(v10)[i] = highs[i];
    }

    reinterpret_cast<short*>(v00)[its] = reinterpret_cast<short*>(v00)[its - 1];
    reinterpret_cast<short*>(v10)[its] = reinterpret_cast<short*>(v10)[its - 1];

    __m128i v01[2] = {_mm_setzero_si128(), _mm_setzero_si128()};
    __m128i v11[2] = {_mm_setzero_si128(), _mm_setzero_si128()};

    __m128i min[2];
    __m128i max[2];

    min[0] = _mm_min_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_min_epi16(v10[0], v00[0]));
    max[0] = _mm_max_epi16(_mm_max_epi16(v11[0], v01[0]), _mm_max_epi16(v10[0], v00[0]));

    min[1] = _mm_min_epi16(_mm_min_epi16(v11[1], v01[1]), _mm_min_epi16(v10[1], v00[1])); 
    max[1] = _mm_max_epi16(_mm_max_epi16(v11[1], v01[1]), _mm_max_epi16(v10[1], v00[1]));

    reinterpret_cast<__m128i*>(mins)[0] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[0], min[0]);
    reinterpret_cast<__m128i*>(maxs)[0] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[0], max[0]);

    reinterpret_cast<__m128i*>(mins)[1] = _mm_min_epi16(reinterpret_cast<__m128i*>(mins)[1], min[1]);
    reinterpret_cast<__m128i*>(maxs)[1] = _mm_max_epi16(reinterpret_cast<__m128i*>(maxs)[1], max[1]);
}

Now with clang it gives it gives me the expected output but in gcc it prints all zeros: godbolt link

Playing around I discovered that gcc gives me the right results when I compile with -O1 but goes wrong with -O2 and -O3, suggesting the optimiser is going awry. Is there something particularly wrong I'm doing that would cause this behavior?

As a workaround I can wrap things up in a union and gcc will then give me the right result, but that feels a little icky: godbolt link 2

Any ideas?

Solution

The problem is that you're using short* to access the elements of a __m128i* object. That violates the strict-aliasing rule. It's only safe to go the other way, using __m128i* dereference or more normally _mm_load_si128( (const __m128i*)ptr ).

__m128i* is exactly like char* - you can point it at anything, but not vice versa: Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

The only standard blessed way to do type punning is with memcpy:

    memcpy(v00, lows, its * sizeof(short));
    memcpy(v10, highs, its * sizeof(short));
    memcpy(reinterpret_cast<short*>(v00) + its, lows + its - 1, sizeof(short));
    memcpy(reinterpret_cast<short*>(v10) + its, highs + its - 1,  sizeof(short));

https://godbolt.org/z/f63q7x

I prefer just using aligned memory of the correct type directly:

    alignas(16) short v00[16];
    alignas(16) short v10[16];
    auto mv00 = reinterpret_cast<__m128i*>(v00);
    auto mv10 = reinterpret_cast<__m128i*>(v10);
    _mm_store_si128(mv00, _mm_setzero_si128());
    _mm_store_si128(mv10, _mm_setzero_si128());
    _mm_store_si128(mv00 + 1, _mm_setzero_si128());
    _mm_store_si128(mv10 + 1, _mm_setzero_si128());

    for (int i = 0; i < its; ++i) {
        v00[i] = lows[i];
        v10[i] = highs[i];
    }

    v00[its] = v00[its - 1];
    v10[its] = v10[its - 1];

https://godbolt.org/z/bfanne

I'm not positive that this setup is actually standard-blessed (it definitely is for _mm_load_ps since you can do it without type punning at all) but it does seem to also fix the issue. I'd guess that any reasonable implementation of the load/store intrinsics is going to have to provide the same sort of aliasing guarantees that memcpy does since it's more or less the kosher way to go from straight line to vectorized code in x86.

As you mentioned in your question, you can also force the alignment with a union, and I've used that too in pre c++11 contexts. Even in that case though, I still personally always write the loads and stores explicitly (even if they're just going to/from aligned memory) because issues like this tend to pop up if you don't.