What is the most efficient way to do unsigned 64 bit comparison on SSE2?

PCMPGTQ doesn't exist on SSE2 and doesn't natively work on unsigned integers. Our goal here is to provide backward-compatible solutions for unsigned 64-bit comparisons so we can include them into the WebAssembly SIMD standard.

This is a sister question to this one for ARMv7+NEON: What is the most efficient way to do SIMD unsigned 64 bit comparison (CMHS) on ARMv7 with NEON?

And is related to the questions for signed comparisons variants already answered for SSE2 and Neon:

How to simulate pcmpgtq on sse2?

What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

Solution

The carry out from a subtract is an unsigned comparision predicate.

We can capture the carry out using same trick that is used when averaging two numbers. Only here we'll base it on a half-subtractor instead of a half-adder.

__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
    b = _mm_xor_si128(b, a); // diff
    a = _mm_and_si128(a, b); // borrow `a & ~b`
    b = _mm_sub_epi64(_mm_srli_epi64(b, 1), a); 
    return _mm_shuffle_epi32(_mm_srai_epi32(b, 31), _MM_SHUFFLE(3,3,1,1));
}

Translated from Hacker's Delight:

static
__m128i sse2_cmpgt_epu64(__m128i a, __m128i b) {
    __m128i r = _mm_andnot_si128(_mm_xor_si128(b, a), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_andnot_si128(b, a));
    return _mm_shuffle_epi32(_mm_srai_epi32(r, 31), _MM_SHUFFLE(3,3,1,1));
}

Concept: If mixed "signs" (unsigned MSBs) then return a else return b - a

(MSB(a) ^ MSB(b)) ? a : b - a; // result in MSB

This makes sense:

if a's MSB is set and b's isn't, a is unsigned above (so MSB(a) is our result)
if b's MSB is set and a's isn't, a is unsigned below (so MSB(a) is our result)
if their MSBs are the same, they values are in the same half of the unsigned range so b-a is effectively a 63-bit subtraction. The MSBs will cancel and the MSB of b-a will be equal to the "borrow" output which tells you if a is strictly above b. (Like the CF flag for scalar sub. jb is jc). So MSB(b-a) is our result.

Note that the SIMD andnot/and/or is a bit-blend, but we only care about the MSB. We broadcast it with srai -> shuffle_epi32, discarding the garbage in lower bits. (Or with SSE3, movshdup as described in @Soont's answer.)

It differs from signed compare:

(MSB(a) ^ MSB(b)) ? ~a : b - a; // result in MSB

If signs are mixed then the sign of ~a is also the sign of b, of course.