Search code examples
c++intrinsicsavxavx2

unexpected _mm256_shuffle_epi with __256i vectors


I had seen this great answer on image conversions using __m128i, and thought I'd try and use AVX2 to see if I could get it any faster. The task is taking an input RGB image and converting it to RGBA (note the other question is BGRA, but that's not really a big difference...).

I can include more code if desired, but this stuff gets quite verbose and I'm stuck on something seemingly very simple. Suppose for this code that everything is 32-byte aligned, compiled with -mavx2, etc.

Given an input uint8_t *source RGB and output uint8_t *destination RGBA, it goes something like this (just trying to fill a quarter of the image in stripes [since this is vector land]).

#include <immintrin.h>
__m256i *src = (__m256i *) source;
__m256i *dest = (__m256i *) destination;

// for this particular image
unsigned width = 640;
unsigned height = 480;
unsigned unroll_N = (width * height) / 32;
for(unsigned idx = 0; idx < unroll_N; ++idx) {
    // Load first portion and fill all of dest[0]
    __m256i src_0 = src[0];
    __m256i tmp_0 = _mm256_shuffle_epi8(src_0,
        _mm256_set_epi8(
            0x80, 23, 22, 21,// A07 B07 G07 R07
            0x80, 20, 19, 18,// A06 B06 G06 R06
            0x80, 17, 16, 15,// A05 B05 G05 R05
            0x80, 14, 13, 12,// A04 B04 G04 R04
            0x80, 11, 10,  9,// A03 B03 G03 R03
            0x80,  8,  7,  6,// A02 B02 G02 R02
            0x80,  5,  4,  3,// A01 B01 G01 R01
            0x80,  2,  1,  0 // A00 B00 G00 R00
        )
    );

    dest[0] = tmp_0;

    // move the input / output pointers forward
    src  += 3;
    dest += 4;
}// end for

This doesn't even actually work. There are stripes showing up in each "quarter".

  • My understanding is 0x80 should be used to create 0x00 in the mask
    • It doesn't really even matter what value gets there (it's the alpha channel, in the real code it gets OR'd with 0xff like the linked answer).
  • It somehow seems to be related to rows 04 to 07, if I make them all 0x80 leaving just 00-03 the inconsistencies go away.
    • But of course, I'm not copying everything I need to.

What am I missing here? Like is it possible I ran out of registers or something? I'd be very surprised by that...

Image with both parts of shuffle

Using

_mm256_set_epi8(
    // 0x80, 23, 22, 21,// A07 B07 G07 R07
    // 0x80, 20, 19, 18,// A06 B06 G06 R06
    // 0x80, 17, 16, 15,// A05 B05 G05 R05
    // 0x80, 14, 13, 12,// A04 B04 G04 R04
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 11, 10,  9,// A03 B03 G03 R03
    0x80,  8,  7,  6,// A02 B02 G02 R02
    0x80,  5,  4,  3,// A01 B01 G01 R01
    0x80,  2,  1,  0 // A00 B00 G00 R00
)

using the above shuffle instead


Solution

  • _mm256_shuffle_epi8 works like two times an _mm_shuffle_epi8 side-by-side, instead of like a more useful (but probably higher latency) full-width shuffle that can put any byte anywhere. Here's a diagram from www.officedaytime.com/simd512e:

    vpshufb

    AVX512VBMI has new byte-granularity shuffles such as vpermb that can cross lanes, but current processors don't support that instruction set extension yet.