Imagine following code:
uint64_t x = 0x81C6E3292A71F955ULL;
uint32_t y = (uint32_t) (x >> 32);
y
receives higher 32-bit part of 64-bit integer. My question is whether there exists any intrinsic function or any CPU instruction that does this in single operation without doing move and shift?
At least CLang (linked in Try-it-online above) creates two instruction mov rax, rdi
and shr rax, 32
for this, so either CLang doesn't do such optimization, or there exists no such special instruction.
Would be great if there existed imaginary single instruction like movhi dst_reg, src_reg
.
If there was a better way to do this bitfield-extraction for an arbitrary uint64_t, compilers would already use it. (At least in theory; compilers do have missed optimizations, and their choices sometimes favour latency even if it costs more uops.)
You only need intrinsics for things that you can't express efficiently in pure C, in ways the compiler can already easily understand. (Or if your compiler is dumb and can't spot the obvious.)
You could maybe imagine cases where the input value comes from the multiply of two 32-bit values, then it might be worthwhile on some CPUs for the compiler to use widening mul r32
to already generate the result in two separate 32-bit registers, instead of imul r64, r64
+ shr reg,32
, if it can easily use EAX/EDX. But other than gcc -mtune=silvermont
or other tuning options, you can't make the compiler do it that way.
shr reg, 32
has 1 cycle latency, and can run on more than 1 execution port on most modern x86 microarchitectures (https://uops.info/). The only thing one might wish for is that it could put the result in a different register, without overwriting the input.
Most modern non-x86 ISAs are RISC-like with 3-operand instructions, so a shift instruction can copy-and-shift, unlike x86 shifts where the compiler needs a mov
in addition to shr
if it also needs the original 64-bit value later, or (in the case of a tiny function) needs the return value in a different register.
And some ISAs have bitfield-extract instructions. PowerPC even has a fun rotate-and-mask instruction (rlwinm
) (with the mask being a bit-range specified by immediates), and it's a different instruction from a normal shift. Compilers will use it as appropriate - no need for an intrinsic. https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99465
x86 with BMI2 has rorx rax, rdi, 32
to copy-and-rotate, instead of being stuck shifting within the same register. A function returning uint32_t
could/should use that instead of mov+shr, in the stand-alone version that doesn't inline because the caller already has to ignore high garbage in RAX. (Both x86-64 System V and Windows x64 define the return value as only the register width matching the C type of the arg; e.g. returning uint32_t
means that the high 32 bits of RAX are not part of the return value, and can hold anything. Usually they're zero because writing a 32-bit register implicitly zero-extends to 64, but something like return bar()
where bar returns uint64_t can just leave RAX untouched without having to truncate it; in fact an optimized tailcall is possible.)
There's no intrinsic for rorx
; compilers are just supposed to know when to use it. (But gcc/clang -O3 -march=haswell
miss this optimization.) https://godbolt.org/z/ozjhcc8Te
If a compiler was doing this in a loop, it could have 32
in a register for shrx reg,reg,reg
as a copy-and-shift. Or more silly, it could use pext
with 0xffffffffULL << 32
as the mask. But that's strictly worse that shrx
because of the higher latency.
AMD TBM (Bulldozer-family only, not Zen) had an immediate form of bextr
(bitfield-extract), and it ran efficiently as 1 uop (https://agner.org/optimize/). https://godbolt.org/z/bn3rfxzch shows gcc11 -O3 -march=bdver4
(Excavator) uses bextr rax, rdi, 0x2020
, while clang misses that optimization. gcc -march=znver1
uses mov + shr because Zen dropped Trailing Bit Manipulation along with the XOP extension.
Standard BMI1 bextr
needs position/len in a register, and on Intel CPUs is 2 uops so it's garbage for this. It does have an intrinsic, but I recommend not using it. mov
+shr
is faster on Intel CPUs.