Search code examples
assemblyx86sseintrinsicssse4

Can PTEST be used to test if two registers are both zero or some other condition?


What can you do with SSE4.1 ptest other than testing if a single register is all-zero?

Can you use a combination of SF and CF to test anything useful about two unknown input registers?

What is PTEST good for? You'd think it would be good for checking the result of a packed-compare (like PCMPEQD or CMPPS), but at least on Intel CPUs, it costs more uops to compare-and-branch using PTEST + JCC than with PMOVMSK(B/PS/PD) + macro-fused CMP+JCC.

See also Checking if TWO SSE registers are not both zero without destroying them


Solution

  • No, unless I'm missing something clever, ptest with two unknown registers is generally not useful for checking some property about both of them. (Other than obvious stuff you'd already want a bitwise-AND for, like intersection between two bitmaps).

    To test two registers for both being all-zero, OR them together and PTEST that against itself.


    ptest xmm0, xmm1 produces two results:

    • ZF = is xmm0 & xmm1 all-zero?
    • CF = is (~xmm0) & xmm1 all-zero? (1 if all the set bits in XMM1 are set in XMM0)

    If the second vector is all-zero, the flags don't depend at all on the bits in the first vector.

    It may be useful to think of the "is-all-zero" checks as a NOT(bitwise horizontal-OR()) of the AND and ANDNOT results. But probably not, because that's too many steps for my brain to think through easily. That sequence of vertical-AND and then horizontal-OR does maybe make it easier to understand why PTEST doesn't tell you much about a combination of two unknown registers, just like the integer TEST instruction.

    Here's a truth table for a 2-bit ptest a,mask. Hopefully this helps in thinking about mixes of zeros and ones with 128b inputs.

    Note that CF(a,mask) == ZF(~a,mask).

    a    mask     ZF    CF
    00   00       1     1
    01   00       1     1
    10   00       1     1
    11   00       1     1
    
    00   01       1     0
    01   01       0     1
    10   01       1     0
    11   01       0     1
    
    00   10       1     0
    01   10       1     0
    10   10       0     1
    11   10       0     1
    
    00   11       1     0
    01   11       0     0
    10   11       0     0
    11   11       0     1
    

    Intel's intrinsics guide lists 2 interesting intrinsics for it. Note the naming of the args: a and mask are a clue that they tell you about the parts of a selected by a known AND-mask.

    • _mm_test_mix_ones_zeros (__m128i a, __m128i mask): returns (ZF == 0 && CF == 0)
    • _mm_test_all_zeros (__m128i a, __m128i mask): returns ZF

    There's also the more simply-named versions:

    • int _mm_testc_si128 (__m128i a, __m128i b): returns CF. (As Microsoft docs helpfully point out, this is 1 if all the bits set in b are set in a; otherwise 0.)
    • int _mm_testnzc_si128 (__m128i a, __m128i b): returns (ZF == 0 && CF == 0)
    • int _mm_testz_si128 (__m128i a, __m128i b): returns ZF (The intersection is zero.)

    There are AVX2 __m256i versions of those intrinsics, but the guide only lists the all_zeros and mix_ones_zeros alternate-name versions for __m128i operands.

    If you want to test some other condition from C or C++, you should use testc and testz with the same operands, and hope that your compiler realizes that it only needs to do one PTEST, and hopefully even use a single JCC, SETCC, or CMOVCC to implement your logic. (I'd recommend checking the asm, at least for the compiler you care about most.)


    Note that _mm_testz_si128(v, set1(0xff)) is always the same as _mm_testz_si128(v,v), because that's how AND works. But that's not true for the CF result.

    You can check for a vector being all-ones using

    bool is_all_ones = _mm_testc_si128(v, _mm_set1_epi8(0xff));
    

    This is probably no faster (but smaller code-size) than a _mm_cmpeq_epi8 against a vector of all-ones, with the usual _mm_movemask_epi8() == 0xffff, at least if you're branching on it so the scalar cmp is free, fusing with the jcc, so both ways are 3 total uops including the (cmp)/jcc. Except on Zen 1 & 2 where ptest is only 1 uop instead of 2 on other CPUs, then it has an advantage: https://uops.info/. It doesn't avoid the need for a vector constant in this case.

    PTEST does have the advantage that it doesn't destroy either input operand, even without AVX.