So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.
the two ways are given below
After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or
, while the second one is more uniform sal>sal>sal>or>or>mov>or
does this have some significant effect in the performance say for example if we are dealing with a larger word?
Two methods
int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0;
combine = byte4;
combine <<=8;
combine |= byte3;
combine <<=8;
combine |= byte2;
combine <<=8;
combine |= byte1;
return combine;
}
int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0, temp;
temp = byte4;
temp <<= 24;
combine |= temp;
temp = byte3;
temp <<= 16;
combine |= temp;
temp = byte2;
temp <<= 8;
combine |= temp;
temp = byte1;
combine |= temp;
return combine;
}
Disassembly
// method1(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edi, dil
movzx esi, sil
movzx edx, dl
movzx eax, cl
sal edi, 8
or esi, edi
sal esi, 8
or edx, esi
sal edx, 8
or eax, edx
ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edx, dl
movzx ecx, cl
movzx esi, sil
sal edi, 24
sal edx, 8
sal esi, 16
or edx, ecx
or edx, esi
mov eax, edx
or eax, edi
ret
This might be "premature optimization", but I just want to know if there is a difference.
now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
That hasn't been true for decades. Modern CPUs can execute instructions in parallel if they're independent of each other. See
In your case, method 2 is clearly better (with GCC11 -O2 specifically) because the 3 shifts can happen in parallel, leaving only a chain of or
instructions. (Most modern x86 CPUs only have 2 shift units, but the 3rd shift can be happening in parallel with the first OR).
Your first version has one long dependency chain of shift/or/shift/or (after the movzx
zero-extension), so it has the same throughput but worse latency. If it's not on the critical path of some larger computation, performance would be similar.
The first version also has a redundant movzx edi, dil
, because GCC11 -O2 doesn't realize that the high bits will eventually get shifted out the top of the register by the 3x8 = 24 bits of shifting. Unfortunately GCC chose to movzx
into the same register (RDI into RDI), not for example movzx eax, dil
which would let mov-elimination work.
The second one has a wasted mov eax, edx
because GCC didn't realize it should do one of the movzx
operations into EAX, instead of zero-extending each reg into itself (defeating mov-elimination). It also could have used lea eax, [edx + edi]
to merge into a different reg, because it could have proved that those values couldn't have any overlapping bit-positions, so |
and +
would be equivalent.
That wasted mov
generally only happens in small functions; apparently GCC's register allocator has a hard time when values need to be in specific hard registers. If it had its choice of where to produce the value, it would just end up with it in EDX.
So on Intel CPUs, yes by coincidence of different missed optimizations, both versions have 3 non-eliminated movzx
and one instruction which can benefit from mov-elimination. On AMD CPUs, movzx
is never eliminated, so the movzx eax, cl
doesn't benefit.
Unfortunately, your compiler chooses to do all three or
operations in sequence, instead of a tree of dependencies. (a|b) | (c|d)
would be lower worst-case latency than (((a|b) | c) | d)
, critical path length of 2 from all inputs vs. 3 from d
, 2 from c
, 1 from a
or b
. (Writing it in C those different ways doesn't actually change how compilers make asm, because they knows that OR is associative. I'm using that familiar syntax to represent to dependency pattern of the assembly.)
So if all four inputs were ready at the same time, combining pairs would be lower latency, although it's impossible for most CPUs to produce three shift results in the same cycle.
I was able to hand-hold earlier GCC (GCC5) into making this dependency pattern (Godbolt compiler explorer). I used unsigned
to avoid ISO C undefined behaviour. (GNU C does define the behaviour of left shifts even when a set bit is shifted in or out of the sign bit.)
int method4(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
unsigned combine = (unsigned)byte4 << 24;
combine |= byte3 << 16;
unsigned combine_low = (byte2 << 8) | byte1;
combine |= combine_low;
return combine;
}
# GCC5.5 -O3
method4(unsigned char, unsigned char, unsigned char, unsigned char):
movzx eax, sil
sal edi, 24
movzx ecx, cl
sal eax, 16
or edi, eax # (byte4<<24)|(byte3<<16)
movzx eax, dl
sal eax, 8
or eax, ecx # (byte2<<8) | (byte1)
or eax, edi # combine halves
ret
But GCC11.2 makes the same asm for this vs. method2. That would be good if it was optimal, but it's not.
the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or
The dependency chains are the key factor. Mainstream x86 has been out-of-order exec for over 2 decades, and there haven't been any in-order exec x86 CPUs sold for years. So instruction scheduling (ordering of independent instructions) generally isn't a big deal over very small distances like a few instructions. Of course, in the alternating shl/or version, they're not independent so you couldn't reorder them without breaking it or rewriting it.
This part is only relevant if you're a compiler / JIT developer trying to get a compiler to do a better job for source like this. I'm not sure there's a clear win here, although maybe yes if we can't inline this function so the movzx
instructions are actually needed.
We can certainly save instructions, but even modern Intel CPUs still have partial-register merging penalties for high-8-bit registers like AH. And it seems the AH-merging uop can only issue in a cycle by itself, so it effectively costs at least 4 instructions of front-end bandwidth.
movzx eax, dl
mov ah, cl
shl eax, 16 # partial register merging when reading EAX
mov ah, sil # oops, not encodeable, needs a REX for SIL which means it can't use AH
mov al, dil
Or maybe this, which avoids partial-register stalls and false dependencies on Intel Haswell and later. (And also on uarches that don't rename partial regs at all, like all AMD, and Intel Silvermont-family including the E-cores on Alder Lake.)
# good on Haswell / AMD
# partial-register stalls on Intel P6 family (Nehalem and earlier)
merge4(high=EDI, mid_hi=ESI, mid_lo=EDX, low=ECX):
mov eax, edi # mov-elimination possible on IvB and later, also AMD Zen
# zero-extension not needed because more shifts are coming
shl eax, 8
shl edx, 8
mov al, sil # AX = incoming DIL:SIL
mov dl, cl # DX = incoming DL:CL
shl eax, 16
mov ax, dx # EAX = incoming DIL:SIL:DL:CL
ret
This is using 8-bit and 16-bit mov
as an ALU merge operation pretty much like movzx + or
, i.e. a bitfield insert into the low 8 or low 16. I avoided ever moving into AH or other high-8 registers, so there's no partial-register merging on Haswell or later.
This is only 7 total instructions (not counting ret
), all of them single-uop. And one of them is just a mov
which can often be optimized away when inlining, because the compiler will have its choice of which registers to have the value in. (Unless the original value of the high byte alone is still needed in a register). It will often know it already has a value zero-extended in a register after inlining, but this version doesn't depend on that.
Of course, if you were eventually storing this value to memory, doing 4 byte stores would likely be good, especially if it's not about to be reloaded. (Store-forwarding stall.)
Related:
mov
to copy an incoming uint8_t
. (Although that's saying why it's fine even on CPUs that do rename DIL separate from the full RDI, because callers will have written the full register.)Semi-related: