Why does inverting the parameters to a CMPGT comparison function work as a CMPLT?

I'm working with AVX2 in the process of optimizing a small mathematics library for a project, however, I've stumbled into minor inconsistencies.

AVX2 lacks the support for a CMPLT function for packed 32b i's, which is unfortunate. An alternative measure to create such an operation resides in the _mm256_cmpgt_epi32 function using a practice called "property transposition," apparently.

Reference:

_mm256_cmpgt_epi32(a,b)

Expectedly, the computation of a > b would be the desired input, however, to achieve our nonexistent _mm256_cmplt_epi32() function, we'd need to write out the inverse of our original input like so: _mm256_cmpgt_epi32(b,a).

My question being: Why does inerverting the input change the functions' operation to one which should have already been provided ( that being _mm256_cmplt_epi32 )?

Solution

a<b (a less than b) is mathematically the same as b>a (b greater than a). The visual indication of which side the symbol is opening towards has the same meaning whether it's left or right, but < is the less-than operator while > is the greater-than operator.

i.e. to reverse the operands of a comparison, mirror the operator. If this isn't obvious, try it for a few small-integer cases like 1 > 2 vs. 2 < 1, and with equal operands, to see that it works for all 3 possible cases of relations between two integers (greater, equal, or less).

Integer comparison doesn't have any 2's complement wrap-around corner cases that would make that mathematical identity not work. Contrast this with FP cmppd xmm,xmm/mem, imm8, because floating-point has a 4th possibility (unordered)¹.

Intrinsics already could provide both lt and gt integer comparisons (but not ge or le) in terms of the pcmpgt hardware instruction.

One reason not to do so is as a reminder that only the right-hand operand can fold a load into a memory source. And (without AVX) the left hand side will need to be copied if it's also needed after the compare. Or just to have fewer intrinsics in the list, or fewer that don't have an exactly corresponding asm instruction.

Those are reasons it might have made sense to provide more comparison asm instructions, but for whatever reason, x86 didn't until after AVX2. AVX-512 eventually did provide comparison instructions that take a predicate as an immediate instead of part of the opcode (like cmpps), giving a choice of predicate. (vpcmpb/w/d/q and vpcmpub/w/d/q also finally providing unsigned integer compares.)

Only providing one or two predicates is not unique to x86: MIPS for example only provides slt / sltu for branchless scalar integer compare-into-register. That means booleanizing a condition into a 0 / 1 integer takes two instructions instead of 1 for some conditions if it takes an xori reg,reg,1 to invert an slt result, or if the condition was == or != (How to do less than or equal in Assembly Language(MIPS)?).

On MIPS, branching is always at most an slt (or sltu) and a bnez or beqz to branch the way you want whether the slt gives you a zero or a one. (Greater than, less than equal, greater than equal in MIPS). Unlike RISC-V, MIPS doesn't have two-register blt by design, so the branch condition could be ready in the first half-cycle of the EX stage on early MIPS hardware.

Anyway, MIPS's RISC design might have influenced the designers of MMX, when x86's initial choice was made to provide only eq and signed-gt integer comparisons. Transistor budgets were much smaller at the time (Pentium-MMX was released in the mid 90s). Intel figured that most hot loops wouldn't speed up a lot from providing <= or unsigned, I guess, so they didn't get around to adding more flexible integer compares until AVX-512.

Footnote 1:
Contrast this with FP compares, where there are 4 possible relations between two values, greater, equal, less, or unordered if either input is NaN.

This is part of the reason SSE did design cmpps/cmppd to take the predicate as an immediate, allowing a wider choice. But before AVX, even FP compares were limited to the first 7 predicates, ==, <, <=, unordered, !=, !<, !<=, and "ordered". The manual has a truth table of results for each predicate for each of the 4 possible relations between the inputs, and says

The greater-than relations that the processor does not implement require more than one instruction to emulate in software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to the correct destination register and that the source operand is left intact.)

Even for FP, a > b is exactly the same condition as b < a.

SSE was designed later than MMX when transistor budgets were larger, and it was being decoded by P6 not P5. (Into 2 uops on the first-gen SSE implementation, for the non-scalar versions). Possibly also not needing to be 1 cycle latency made room for a compare and then apply predicate. Some / all of these reasons might have been part of the choice to make FP compares somewhat more flexible by spending an extra byte of code-size on an immediate predicate. Perhaps also experience with cases where MMX only having eq and gt compares was inconvenient in practice.