Search code examples
assemblyx86cpu-architectureinstruction-setinstructions

What is the rationale behind the CPU instruction set, in general and for x86's SETCC?


While examining the instruction set for Intel x86 processors I noticed there are 'intuitive' instructions like 'mov', 'add', 'mul' ... while others seem a bit unnatural like 'sete'. The question is more out of curiosity rather than practical concerns: why would designers chose to implement particular execution scenarios in single instructions? Do you know any reading material that would explain such design decisions?


Solution

  • There are at least two possible sequences to achieve the code. Here's my analysis of them

    ; "classic" code
     xor eax,eax     ; zero AL in case we don't write it.
     cmp edx,15
     jne past
      mov al,1        ; AL = (edx==15)
    past:
    

    And

    ; "evolved" code
    ; xor eax,eax optional to get a bool zero-extended to full register
    cmp edx,15
    sete al        ; AL = 0 or 1 according to edx==15
    

    It is a bit of a mindset that a conditional, simple assignment "should" involve a conditional jump on the opposite condition, even if it only jumps around a single instruction. But it is - in your own words - a particular execution scenario that occurs frequently so if a better alternative were available, why not?

    When code executes there are many factors affecting execution speed. Two of them are the time it takes for the result of a comparison/arithmetic/boolean operation to reach the flags register and the other the execution penalty when a jump is taken (I'm over-simplifying this a bit).

    So the classic code will either execute a move or take a jump. The former will probably be executed in parallel with other code and the latter may cause the prefetcher to load data from the new position resulting in wait states. The processor's branch prediction may be involved and may - depending on a lot of factors - predict incorrectly which incurs additional penalties.

    In the evolved case, code prefetching is not affected at all which is good for execution speed. Also the sete sequence will probably fit in fewer bytes than the mov+jne combo which means that relatively less code cache line capacity/work will be involved in the execution which means there will be relatively more data cache capacity/work will be freed up as well. It the contents of the assignment isn't needed right away the sete could be rescheduled to a position where it blends in better (execution-wise) with the surrounding code. This rescheduling could be performed explicitly (by the compiler) or implicitly (by the CPU itself). Static scheduling (by the compiler) is limited because most x86 integer instruction affect FLAGS so they can't be separated far.

    For normal (usually un-tuned), bloated application code the use of instructions such as this will have little impact on overall performance. In highly specialized, hand tuned code with very tight loops the difference between executing within three rather than four or five cache lines could make an enormous difference, especially if multiple copies of the code are running on different cores.