Search code examples
assemblybit-manipulationssesse2

Unpacking a bitfield (Inverse of movmskb)


MOVMSKB does a really nice job of packing byte fields into bits.
However I want to do the reverse.
I have a bit field of 16 bits that I want to put into a XMM register.
1 byte field per bit.
Preferably a set bit should set the MSB (0x80) of each byte field, but I can live with a set bit resulting in a 0xFF result in the byte field.

I've seen the following option on https://software.intel.com/en-us/forums/intel-isa-extensions/topic/298374:

movd mm0, eax
punpcklbw mm0, mm0
pshufw mm0, mm0, 0x00
pand mm0, [mask8040201008040201h]
pcmpeb mm0, [mask8040201008040201h]

However this code only works with MMX registers and cannot be made to work with XMM regs because pshufw does not allow that.

I know I can use PSHUFB, however that's SSSE3 and I would like to have SSE2 code because it needs to work on any AMD64 system.

Is there a way to do this is pure SSE2 code?
no intrinsics please, just plain intel x64 code.


Solution

  • Luckily pshufd is SSE2, you just need to unpack it once more. I believe this should work:

    movd xmm0, eax
    punpcklbw xmm0, xmm0
    punpcklbw xmm0, xmm0
    pshufd xmm0, xmm0, 0x50
    pand xmm0, [mask]
    pcmpeqb xmm0, [mask]
    

    Johan said:

    If you're starting with a word the first unpack will give you a dword, allowing you to shorten it like so:

    movd xmm0, eax
    punpcklbw xmm0, xmm0
    pshufd xmm0, xmm0, 0x00
    pand xmm0, [mask]
    pcmpeqb xmm0, [mask]
    

    However this code should not work. Example: Assume input is 0x00FF (word), that is we want the low 8 bytes set.

    punpcklbw xmm0, xmm0    ; 00 00 00 00 00 00 00 00 00 00 00 00 00 00 FF FF
    pshufd xmm0, xmm0, 0x00 ; 00 00 FF FF 00 00 FF FF 00 00 FF FF 00 00 FF FF
    pand xmm0, [mask]       ; 00 00 02 01 00 00 02 01 00 00 02 01 00 00 02 01
    pcmpeqb xmm0, [mask]    ; 00 00 FF FF 00 00 FF FF 00 00 FF FF 00 00 FF FF
    

    This is the wrong result because we wanted 00 00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF. Sure, it does give you 8 set bytes, just not the 8 which correspond to the bits.