C++ SSE2 or AVX2 intrinsics for grayscale to ARGB conversion

I was wondering if there is an SSE2/AVX2 integer instruction or sequence of instructions(or intrinsics) to be performed in order to achieve the following result:

Given a row of 8 byte pixels of the form:

A = {a, b, c, d, e, f, g, h}

Is there any way to load these pixels in an YMM register that contains 8 32bit ARGB pixels, such that the initial grayscale value gets broadcast to the other 2 bytes of each corresponding 32 bit pixel? The result should be something like this: ( the 0 is the alpha value )

B = {0aaa, 0bbb, 0ccc, 0ddd, 0eee, 0fff, 0ggg, 0hhh}

I'm a complete beginner in vector extensions so I'm not even sure how to approach this, or if it's at all possible.

Any help would be appreciated. Thanks!

UPDATE1

Thanks for your answers. I still have a problem though:

I put this small example together and compiled with VS2015 on x64.

int main()
{
    unsigned char* pixels = (unsigned char*)_aligned_malloc(64, 32);
    memset(pixels, 0, 64);

    for (unsigned char i = 0; i < 8; i++)
        pixels[i] = 0xaa + i;

    __m128i grayscalePix = _mm_load_si128((const __m128i*)pixels);
    __m256i rgba = _mm256_cvtepu8_epi32(grayscalePix);
    __m256i mulOperand = _mm256_set1_epi32(0x00010101);

    __m256i result = _mm256_mullo_epi32(rgba, mulOperand);

   _aligned_free(pixels);
    return 0;
}

The problem is that after doing

__m256i rgba = mm256_cvtepu8_epi32(grayscalePix)

rgba only has the first four doublewords set. The last four are all 0.

The Intel developer manual says:

VPMOVZXBD ymm1, xmm2/m64
Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 32-bit integers in ymm1.

I'm not sure if this is intended behaviour or I'm still missing something.

Thanks.

Solution

Update: @chtz's answer is an even better idea, using a cheap 128->256 broadcast load instead of vpmovzx to feed vpshufb, saving shuffle-port bandwidth.

Start with PMOVZX like Mark suggests.

But after that, PSHUFB (_mm256_shuffle_epi8) will be much faster than PMULLD, except that it competes for the shuffle port with PMOVZX. (And it operates in-lane, so you still need the PMOVZX).

So if you only care about throughput, not latency, then _mm256_mullo_epi32 is good. But if latency matters, or if your throughput bottlenecks on something other than 2 shuffle-port instructions per vector anyway, then PSHUFB to duplicate the bytes within each pixel should be best.

Actually, even for throughput, _mm256_mullo_epi32 is bad on HSW and BDW: it's 2 uops (10c latency) for p0, so it's 2 uops for one port.

On SKL, it's 2 uops (10c latency) for p01, so it can sustain the same one per clock throughput as VPMOVZXBD. But it's an extra 1 uop, making it more likely to bottleneck.

(VPSHUFB is 1 uop, 1c latency, for port 5, on all Intel CPUs that support AVX2.)