Search code examples
cassemblyx86intrinsicsavx

What is the correct intrinsic sequence to do PSRLDQ to an XMM register while keeping the YMM part unchanged?


Assuming xmm0 is the first argument, this is the kind of code I want to produce.

psrldq xmm0, 1
vpermq ymm0, ymm0, 4eh
ret

I wrote this in intrinsics.

__m256i f_alias(__m256i p) {
    *(__m128i *)&p = _mm_bsrli_si128(*(__m128i *)&p, 1);
    return _mm256_permute4x64_epi64(p, 0x4e);
}

This is the result from clang, and it is okay.

f_alias: #clang
        vpsrldq xmm1, xmm0, 1
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret

But gcc produces bad code.

f_alias: #gcc
        push    rbp
        vpsrldq xmm2, xmm0, 1
        mov     rbp, rsp
        and     rsp, -32
        vmovdqa YMMWORD PTR [rsp-32], ymm0
        vmovdqa XMMWORD PTR [rsp-32], xmm2
        vpermq  ymm0, YMMWORD PTR [rsp-32], 78
        leave
        ret

I tried a different version.

__m256i f_insert(__m256i p) {
    __m128i xp = _mm256_castsi256_si128(p);
    xp = _mm_bsrli_si128(xp, 1);
    p = _mm256_inserti128_si256(p, xp, 0);
    return _mm256_permute4x64_epi64(p, 0x4e);
}

clang produces the same code.

f_insert: #clang
        vpsrldq xmm1, xmm0, 1
        vperm2i128      ymm0, ymm0, ymm1, 33
        ret

But gcc is too literal in translating the intrinsics.

f_insert: #gcc
        vpsrldq xmm1, xmm0, 1
        vinserti128     ymm0, ymm0, xmm1, 0x0
        vpermq  ymm0, ymm0, 78
        ret

What is a good way to write this operation in intrinsics? I'd like to make gcc produce good code like clang if possible.

Some side questions.

  1. Is it bad to mix PSRLDQ with AVX code? Is it better to use VPSRLDQ like what clang did? If nothing is wrong using PSRLDQ, it seems to be a simpler approach because it doesn't zero the YMM part like the VEX version.
  2. What is the purpose of having both F and I instructions which seems to do the same job anyway, for example, VINSERTI128/VINSERTF128 or VPERMI128/VPERMF128?

Solution

  • Optimal asm on Skylake would use legacy SSE psrldq xmm0, 1, with the effect of leaving the rest of the vector unchanged handled with as a data dependency. (On a register the instruction reads anyway, since this isn't movdqa or something). But that would be disastrous on Haswell, or on Ice Lake, both of which have a costly transition to a "saved uppers" state when a legacy-SSE instruction writes an XMM register when any YMM has a "dirty" upper half. I'm unsure how Zen1 or Zen2/3/4... handle it.


    Nearly as good on Skylake, and optimal everywhere else, is to copy-and-shift then vpblendd to copy in the original high half, since you don't need to move any data between 128-bit lanes. (The _mm256_permute4x64_epi64(p, 0x4e); in your version is a lane-swap separate from the operation you asked about in the title. If that's something else you also want, then keep using vperm2i128 to merge as part of that lane-swap. If not, it's a bug.)

    vpblendd is more efficient than any shuffle, able to run on any 1 of multiple execution ports, with 1 cycle latency on Intel CPUs. (Lane-crossing shuffles like vperm2i128 are 1 uop / 3 cycle latency on mainstream Intel, and significantly worse on AMD, and on the E-cores of Alder Lake. https://uops.info/) By contrast, variable blends with a vector control are often more expensive, but immediate blends are very good.

    And yes, it is more efficient on some CPUs to use an XMM (__m128i) shift, instead of shifting both halves and then blending with the original. That would take less typing with cast intrinsics, but if compilers didn't optimize it away you'd be wasting uops on Zen1, and on Alder Lake E-cores, where each half of vpsrldq ymm takes a separate uop.

    __m256i rshift_lowhalf_by_1(__m256i v)
    {
        __m128i low = _mm256_castsi256_si128(v);
       low = _mm_bsrli_si128(low, 1);
       return _mm256_blend_epi32(v, _mm256_castsi128_si256(low), 0x0F);
    }
    

    gcc/clang compile it as written (Godbolt), with xmm byte-shift and YMM vpblendd. (Clang flips the immediate and uses opposite source registers, but same difference.)

    vpblendd is 2 uops on Zen1, because it has to process both halves of the vector. The decoders don't look at the immediate for special cases like keeping a whole half of the vector. And it can still copy to a separate destination, not necessarily overwriting either source in-place. For a similar reason, vinserti128 is also 2 uops, unfortunately. (vextracti128 is only 1 uop on Zen1; I was hoping vinserti128 was going to be only 1, and wrote the following version before checking uops.info):

    // don't use on any CPU *except* Zen1, or an Alder Lake pinned to an E-core.
    __m256i rshift_alder_lake_e(__m256i v)
    {
        __m128i low = _mm256_castsi256_si128(v);
       low = _mm_bsrli_si128(low, 1);
       return _mm256_inserti128_si256(v, low, 0);  // still 2 uops on Zen1 or Alder Lake, same as vpblendd
        // clang optimizes this to vpblendd even with -march=znver1.  That's good for most uarches, break-even for Zen1, so that's fine.
    }
    

    There may be a small benefit on Alder Lake E-cores, where vinserti128 latency is listed as [1;2] instead of a flat 2 for vpblendd. But since any Alder Lake system will have P cores as well, you don't actually want to user vinserti128 because it's much worse on everything else.


    What is the purpose of having both VINSERTI128/VPERMI128 and VINSERTF128/VPERMF128?

    vinserti128 with a memory source only does a 128-bit load, vperm2i128 does a 256-bit load which might cross a cache line or page boundary for data you're not even going to use.

    On AVX CPUs where load/store execution units only have 128-bit wide data paths to cache (like Sandy/Ivy Bridge), that's a significant benefit.

    On CPUs where shuffle units are only 128-bit wide (like Zen1 as discussed in this answer), vperm2i128's 2 full source inputs and arbitrary shuffling make it a lot more expensive (unless I guess you had smarter decoders that emitted a number of uops to move halves of the vector dependent on the immediate).

    e.g. Zen1's vperm2i/f128 is 8 uops, with 2c latency, 3c throughput!. (Zen2 with its 256-bit execution units improves that to 1 uop, 3c latency, 1c throughput). See https://uops.info/


    What is the purpose of having both F and I instructions which seems to do the same job anyway

    Same as always (dating back to stuff like SSE1 orps vs. SSE2 pxor / orpd), to let CPUs have different bypass-forwarding domains for SIMD-integer vs. SIMD-FP.

    Shuffle units are expensive so it's normally worth sharing them between FP and integer (and the way Intel does that these days results in no extra latency when you use vperm2f128 between vpaddd instructions).

    But for example blend is simple so there probably are different FP and integer blend units, and there is a latency penalty for blendvps between paddd instructions. (See https://agner.org/optimize/)