How to load into __m256 from a float* but reading backwards in memory as opposed to forwards?

I've got an array of floats that I'd like to access in reverse order. In my non-vectorized code this is easy.

Here is a simplifed version of the data that I have.

float A[8] = {a, b, c, d, e, f, g, h};
float B[8] = {s, t, u, v, w, x, y, z};

Here is the operation I would like to do.

float C[8] = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

I'd like to be able to do some kind of load_ps operation that will give me something like this:

__m256 A_Loaded         = _mm256_load_ps(&A[0]);
                        = {a, b, c, d, e, f, g, h};

__m256 B_LoadedReversed = _mm256_loadr_ps(&B[7]);
                        = {z, y, x, w, v, u, t, s};

__m256 Output = _mm256_mul_ps(A_Loaded, B_LoadedReversed);
              = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

One of the data sources I have is a lookup table, so could be reversed if push comes to shove, but would much prefer to avoid that as that would compilcate other areas of the program.

I've currently got a botch workaround using _mm256_set_ps() and manually pointing to the data I need, but that is not as performative as I would like.

I know there is a 'reversed' _mm256_set_ps() (_mm256_setr_ps()), but there doesn't seem to be the _mm256_loadr_ps() that I need.

Any ideas and thoughts about this problem would be greatly appreciated! Thanks in advance.

Solution

You can reverse the order inside a __m256 in two steps, using _mm256_permute_ps and _mm_256_permute2f128_ps.

_mm256_permute_ps allows you to permute within each "lane", the high and low 128-bit chunks.
_mm_256_permute2f128_ps allows you to permute 128-bit chunks across lanes.

It's something like this:

__m256 b = _mm256_loadr_ps(&B[0]);
b = _mm256_permute_ps(b, _MM_SHUFFLE(3, 2, 1, 0));
b = _mm256_permute2f128_ps(b, b, 1);

These instructions are documented in the Intel intrinsics guide: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html

How does setr_ps work?

How does setr_ps() reverse things? It just reverses the arguments. Here's the version I pulled from my GCC installation:

extern __inline __m256 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm256_setr_ps (float __A, float __B, float __C, float __D,
                float __E, float __F, float __G, float __H)
{
  return _mm256_set_ps (__H, __G, __F, __E, __D, __C, __B, __A);
}

You can see, setr_ps() does not correspond to any underlying processor capability, it just reorders the arguments.