I am looking to optimize a piece of code that uses popcnt
to compute differences between uint8_t
s. I figure it'd be faster to combine 8 uint8_t
s into a single uintmax_t
and use popcnt64
instead so that the popcnt operation doesn't have to be called 8x more than necessary. What is the fastest way to feed 8 uint8_t
into popcnt64
? Can I use some kind of casting? Should I make use of bit manipulation? I'm not aware of C++'s internal workings so I'm not sure what the fastest way is to make this conversion.
Assuming you don't care about endianness – you just want to treat the uint8_t
s as a uint64_t
and you don't care about the order of the uint8_t
s – then you can just use std::memcpy
to do the type punning:
std::uint64_t combine(std::array<std::uint8_t, 8> b) {
static_assert(sizeof(b) == sizeof(std::uint64_t));
static_assert(std::is_trivially_copyable_v<std::uint64_t>);
static_assert(std::is_trivially_copyable_v<decltype(b)>);
std::uint64_t result;
std::memcpy(&result, b.data(), sizeof(result));
return result;
}
The generated assembly just returns the argument:
combine(std::array<unsigned char, 8ul>): # @combine(std::array<unsigned char, 8ul>)
mov rax, rdi
ret
Using anything else for type punning makes it so you have to worry about strict aliasing rules or alignments of types. It's easy enough to just use std::memcpy
and let the compiler deal with it
Note that the easiest way of calling any variant of popcnt
from C++ is to use std::bitset::count
. So instead of __builtin_popcountll(my_u64)
or __popcnt64(my_u64)
, you could just write std::bitset<64>{my_u64}.count()
and you instantly get portable code.