I've got an array of floats that I'd like to access in reverse order. In my non-vectorized code this is easy.
Here is a simplifed version of the data that I have.
float A[8] = {a, b, c, d, e, f, g, h};
float B[8] = {s, t, u, v, w, x, y, z};
Here is the operation I would like to do.
float C[8] = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};
I'd like to be able to do some kind of load_ps
operation that will give me something like this:
__m256 A_Loaded = _mm256_load_ps(&A[0]);
= {a, b, c, d, e, f, g, h};
__m256 B_LoadedReversed = _mm256_loadr_ps(&B[7]);
= {z, y, x, w, v, u, t, s};
__m256 Output = _mm256_mul_ps(A_Loaded, B_LoadedReversed);
= {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};
One of the data sources I have is a lookup table, so could be reversed if push comes to shove, but would much prefer to avoid that as that would compilcate other areas of the program.
I've currently got a botch workaround using _mm256_set_ps()
and manually pointing to the data I need, but that is not as performative as I would like.
I know there is a 'reversed' _mm256_set_ps()
(_mm256_setr_ps()
), but there doesn't seem to be the _mm256_loadr_ps()
that I need.
Any ideas and thoughts about this problem would be greatly appreciated! Thanks in advance.
You can reverse the order inside a __m256
in two steps, using _mm256_permute_ps
and _mm_256_permute2f128_ps
.
_mm256_permute_ps
allows you to permute within each "lane", the high and low 128-bit chunks.
_mm_256_permute2f128_ps
allows you to permute 128-bit chunks across lanes.
It's something like this:
__m256 b = _mm256_loadr_ps(&B[0]);
b = _mm256_permute_ps(b, _MM_SHUFFLE(3, 2, 1, 0));
b = _mm256_permute2f128_ps(b, b, 1);
These instructions are documented in the Intel intrinsics guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
How does setr_ps() reverse things? It just reverses the arguments. Here's the version I pulled from my GCC installation:
extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_setr_ps (float __A, float __B, float __C, float __D,
float __E, float __F, float __G, float __H)
{
return _mm256_set_ps (__H, __G, __F, __E, __D, __C, __B, __A);
}
You can see, setr_ps() does not correspond to any underlying processor capability, it just reorders the arguments.