How to take the high part of __m256

i have __m256 or __m256i, i want to take the higher part.

Given __m256 variable, I know i can do that with _mm256_extractf128_ps(variable, 1)

but for the low part : _mm256_extractf128_ps(tr3, 0) is better do that *((__m128*)&variable)

I don't know how to take the high part using some pointers with the same way i used before for the low part?

Can I add a number or increment the pointer ? *((__m128*)&variable+128)

Solution

_mm256_extractf128_ps(v, 1) is the best way. If you compiler doesn't compile that efficiently, use a better compiler (e.g. clang has a very good shuffle optimizer).

For the low half, all compilers optimize _mm256_extractf128_ps(v, 0) to not actually use a vextractf128 instruction, but the most explicit way with intrinsics to say you just want to low 128 is _mm256_castps256_ps128 and similar casts for __m256i (_mm256_castsi256_si128) or __m256d.

These normally compile to just using the XMM low half of whatever YMM register the compiler had the vector variable in, although some compilers have missed optimization bugs and sometimes emit a useless vmovaps xmm, xmm instruction instead of just having later instructions read either the low xmm or the full ymm of whatever register.

Using pointer math kind of encourages the compiler to store and reload, which you usually don't want. But in practice most compilers will most of the time optimize it back to ALU shuffles, even if you were trying to avoid a shuffle-port bottleneck and actually do a store/reload.

I don't recommend pointer casting. However, *((__m128*)&variable) and ((__m128*)&variable)[1] are legal because intrinsic vector types such as __m128 are like char - they're allowed to alias any other type without violating strict aliasing and causing Undefined Behaviour.

C pointer math moves the pointer by 1 size unit of the pointed-to type. e.g. +1 on a __m128* moves by 16 bytes, which is one __m128. This is why ++ always works to iterate a pointer over an array. Pointer Arithmetic

Since you want the 2nd __m128, you should add 1 to your __m128*. e.g. *(1 + (__m128*)&variable). C [] syntax is defined in terms of pointer addition + dereference, so we can write it that way, applying [] to the cast result. Both these orders of writing it make it 100% clear that the +1 applies to the __m128* after the cast, not to the __m256 from &var before the cast. Although IIRC, casting has higher precedence than +1 to *((__m128)&var + 1) would also be safe. But writing it the other way means you don't have to remember that when reading code later.

In GNU C, intrinsic types are defined with __attribute__((may_alias)). In MSVC, aliasing is always allowed. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? This is what makes the pointer-casting safe for this type punning.

Any other type, like ((float*)&vec)[0] would violate strict aliasing and be UB.

As I said, since you normally want the compiler to use shuffle instructions, messing around with pointers requires the compiler to optimize away all the pointers. Use intrinsics.