Search code examples
assemblyx86cpu-architecturecpu-registers

If I have an 8-bit value, is there any advantage to using an 8-bit register instead of say, 16, 32, or 64-bit?


The introductory x86 asm literature I read just seems to stick with 32-bit registers (eax, ebx, etc) in all practical scenarios except to demonstrate the 64-bit registers as a thing that also exists. If 16-bit registers are mentioned at all, it is as a historical note explaining why the 32-bit registers have an 'e' in front of their names. Compilers seem equally disinterested in less-than-32-bit registers.

Consider the following C code:

int main(void) { return 511; }

Although main purports to return an int, in fact Linux exit status codes are 8-bit, meaning any value over 255 will be the least significant 8-bits, viz.

hc027@HC027:~$ echo "int main(void) { return 511; }" > exit_gcc.c
hc027@HC027:~$ gcc exit_gcc.c 
hc027@HC027:~$ ./a.out 
hc027@HC027:~$ echo $?
255

So we see that only the first 8-bits of int main(void)'s return value will be used by the system. Yet when we ask GCC for the assembly output of that same program, will it store the return value in an 8-bit register? Let's find out!

hc027@HC027:~$ cat exit_gcc.s
    .file   "exit_gcc.c"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $511, %eax
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609"
    .section    .note.GNU-stack,"",@progbits

Nope! It uses %eax, a very-much-32-bit register! Now, GCC is smarter than me, and maybe the return value of int main(void) is used for other stuff that and don't know where it's return value won't be truncated to the 8 least significant bits (or maybe the C standard decrees that it must return a for realsy, actual int no matter what its actual destiny)

But regardless of the efficacy of my specific example, the question stands. As far as I can tell, the registers under 32-bits are pretty much neglected by modern x86 assembly programmers and compilers alike. A cursory Google of "when to use 16-bit registers x86" returns no relevant answers. I'm pretty curious: is there any advantage to using the 8 and 16-bit registers in x86 CPUs?


Solution

  • So, it doesn't really have to be that way, there's a bit of history going on here. Try running

        mov rax, -1 # 0xFFFFFFFFFFFFFFFF
        mov eax, 0
        print rax
    

    On your favorite x86 desktop (print being based on your environment/language/whatever). What you'll notice is that even though rax started out with all ones, and you think you only wiped out the bottom 32bits, the print statement prints zero! Writes to eax completely wipe rax. Why? That's awfully weird and unintuitive behavior. The reason is simple: Because it's much faster. Trying to maintain the higher values of rax is an absolute pain when you keep writing to eax.

    Intel/AMD however, didn't realize this back when they originally decided to move onto 32bit, and made a fatal error that forever left al/ah to be nothing but a historical relic: When you write to al or ah, the other doesn't get clobbered! This does make more intuitive sense, and it was once a great idea in a 16bit era, because now you have twice as many registers, and you have a 32bit register! But, nowadays, with the move to an abundance of registers, we just don't need more registers anymore. What we really want are faster registers, and to push more GHz. From this point of view, every time you write to al or ah, the processor needs to preserve the other half, which is fundamentally just much more expensive. (Explanation on why, later)

    Enough with the theory, let's get some real tests. Each testcase was tested three times. These tests were run on an Intel Core i5-4278U CPU @ 2.60GHz

    Only rax: 1.067s, 1.072s, 1.097s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov rax, 5
    mov rax, 5
    mov rax, 6
    mov rax, 6
    mov rax, 7
    mov rax, 7
    mov rax, 8
    mov rax, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    Only eax: 1.072s, 1.062s, 1.060s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov eax, 5
    mov eax, 5
    mov eax, 6
    mov eax, 6
    mov eax, 7
    mov eax, 7
    mov eax, 8
    mov eax, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    Only ah: 2.702s, 2.748s, 2.704s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov ah, 5
    mov ah, 5
    mov ah, 6
    mov ah, 6
    mov ah, 7
    mov ah, 7
    mov ah, 8
    mov ah, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    Only ah/al: 1.432s, 1.457s, 1.427s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov ah, 5
    mov al, 5
    mov ah, 6
    mov al, 6
    mov ah, 7
    mov al, 7
    mov ah, 8
    mov al, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    ah and al, then eax: 1.117s, 1.084s, 1.082s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov ah, 5
    mov al, 5
    mov eax, 6
    mov al, 6
    mov ah, 7
    mov eax, 7
    mov ah, 8
    mov al, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    (Note that these tests don't have to do with partial register stall, as I'm not reading eax after writes to ah. In reference to the comments on the main post.)

    As you can see from the tests, using al/ah is much slower. Using eax/rax blow the other times out of the water, and, there is fundamentally no performance difference between rax and eax themselves. As discussed, the reason why is because eax/rax directly overwrite the entire register. However, using ah or al means that the other half needs to be maintained.

    See also How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent for some more experiments and conclusions about what Intel CPUs might be doing internally. And some performance experiments on RMWs vs. write-only access to full and partial registers.


    Now, if you wish, we can delve into the explanation of why it's more efficient to just wipe the register on every usage. On face value, it doesn't seem like it'll matter, just only update the bits that matter, right? What's the big deal?

    Well, Modern CPU's are intelligent, they will very aggressively parallelize operations that the CPU knows can't interfere with each other, but only when such parallelization is actually possible. For example, if you mov eax to ebx, then ebx to ecx, then ecx to edx, then the CPU cannot parallelize it, and it will run slower than usual. However if you write to eax, write to ebx, write to ecx, and write to edx, then the CPU can parallelize all of those operations, and it will run much faster than usual! Feel free to test this on your own. (Update: recent CPUs with sophisticated mov-elimination can handle a chain of dependent mov with zero latency. Intel Ivy Bridge being the first to introduce mov-elimination couldn't always eliminate all of the movs in a chain like that, but this Zen 4 article shows Ice Lake and Zen 2 and later all handling dependent mov at the same throughput as independent mov reg,reg)

    Internally, the way this is implemented is by immediately starting to execute and calculate an instruction, even if earlier instructions are still in the midst of being executed. However, the primary restriction is the following:

    • If an earlier instructions writes to some register A, and the current instruction reads from some register A, then the current instruction must wait until the earlier instruction as been completed in its entirety, which is what causes these kinds of slowdowns.

    In our mov eax, 5 spam test, which took ~1 second, the CPU could aggressively run all of the operations in parallel, because none of the instructions read from anything anyway, they were all write-only. It only needs to ensure that the most recent write is the value that the register holds during any future reads (Which is easy, because even though the operations all occur in overlapping time periods, the one that was started the last will also finish the last).

    In the mov ah, 5 spam test, it was a painful 2.7x slower than the mov eax, 5 spam test, because there's fundamentally no easy way to parallelize the operations. Each operation is marked as "reading from eax", since it depends on the previous value of eax, and it's also marked as "writing to eax", because it modifies the value of eax. If an operation must read from eax, it must occur after the previous operation has finished writing to eax. Thus, parallelization suffers dramatically.

    Also, if you want to try on your own, you'll notice that add eax, 5 spamming and add ah, 5 spamming both take exactly the same amount of time (2.7s on my CPU, exactly the same as mov ah, 5!). In this case, add eax, 5 is marked as "read from eax", and as "write to eax", so it receives exactly the same slowdown as mov ah, 5, which must also both read and write to eax! The actual mov vs add doesn't matter, the logic gates will immediately connect the input to the output via the desired operation in a single tick of the ALU.

    So, I hope that shows why eax's 64bit overwrite feature leads to times that are faster than ah's preservation system.


    There are a couple more details here though, why did the ah/al swap test take a much faster 1.43 seconds? Well, most likely what's happening is that register renaming is helping with all of the "mov ah, 5; mov al, 5" writes. It looks like the CPU was intelligent enough to split "ah" and "al" their own full 64bit registers, since they use different parts of the "eax" register anyway. This thus allows the consecutive pair of ah then al operations to be made in parallel, saving significant time. If "eax" is ever read in its entirety, the CPU would need to coalesce the two "al" vs "ah" registers back into one register, causing a significant slowdown (Shown later). In the earlier "mov ah, 5"-only test, it wasn't possible to split eax into separate registers, be cause we used "ah" every single time anyway.

    And, interestingly, if you look at the ah/al/eax test, you can see that it was almost as fast as the eax test! In this case, I'm predicting that all three got their own registers and the code was thus extremely parallelized.

    Of course, as mentioned, attempting to read eax anywhere in that loop is going to kill performance when ah/al will have to be coalesced, here's an example:

    Times: 3.412s, 3.390s, 3.515s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov ah, 5
    mov al, 5
    xor eax, 5
    mov al, 6
    mov ah, 8
    xor eax, 5
    mov al, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    But, note that the above test doesn't have a proper control group as it uses xor instead of mov (E.g., What if just using "xor" is the reason why it's slow). So, here's a test to compare it to:

    Times: 1.426s, 1.424s, 1.392s

    global _main
    _main:
    mov ecx, 1000000000
    loop:
    test ecx, ecx
    jz exit
    mov ah, 5
    mov al, 5
    xor ah, 5
    mov al, 6
    mov ah, 8
    xor ah, 5
    mov al, 8
    dec ecx
    jmp loop
    exit:
    ret
    

    The above test coalesces very aggressively, which causes the horrible 3.4 seconds that's in-fact far slower than any of other tests. But, the al/ah test splits al/ah into two different registers and thus runs pretty fast, faster than only using ah because consecutive ah/al operations can be parallelized. So, that was a trade-off that Intel was willing to make.

    As mentioned, and as seen, it just doesn't really matter whether you do xor vs add vs mov, this above ah/al still takes 1.4 seconds, bitwise / add / mov all simply directly hook-up the input to the output with very few logic gates, it just doesn't matter which operation you use (However, mul and div will indeed be slower, that requires tougher computation and thus several micro-cycles).


    The past two tests show the reported partial register stall, which to be honest I hadn't even considered at first. I first thought register renaming would help mitigate the problem, which they appear to do in the ah/al mixes and ah/al/eax mixes. However, reads to eax with dirty ah/al values are brutal because the processor now has to combine the ah/al registers. It looks like processor manufactures believed register renaming partial registers was still worth it though, which makes sense since most work with ah/al don't involve reads to eax, you would just read from ah/al if that was your plan. This way, tight loops that bit fiddle with with ah/al benefit greatly, and the only harm is a hiccup on the next use of eax (At which point ah/al are probably not going to be used anymore).

    If Intel wanted, rather than the ah/al register renaming optimization giving 1.4 seconds, normal ah being 2.7 seconds, and register coalescing abuse taking 3.4 seconds, Intel could have not cared about register renaming and all of those tests would have been the exact same 2.7 seconds. But, Intel is smart, they know that there's code out there that will want to use ah and al a lot, but it's not common to find code that uses al and ah a lot, while also reading from the total eax all the time as well.

    Overall, even in the case of no partial register stall, writes to ah are still much slower than writes to eax, which is what I was trying to get across.

    Of course, results may vary. Other processors (Most likely very old ones) might have control bits to shut off half of the bus, which would allow the bus to act like a 16bit or 8bit bus when it needs to. Those control bits would have to be connected via logic gates along the input to the registers, which would slightly slow down any and all usage of the register since now that's one more gate to go through before the register can update. Since such control bits would be off the vast majority of the time (Since it's rare to mess with 8bit/16bit values), it looks like Intel decided not to do that (For good reason).