c++casting type-conversion bit-manipulation uint8t

What is the fastest way to combine 8 uint8_t into a single uintmax_t?

I am looking to optimize a piece of code that uses popcnt to compute differences between uint8_ts. I figure it'd be faster to combine 8 uint8_ts into a single uintmax_t and use popcnt64 instead so that the popcnt operation doesn't have to be called 8x more than necessary. What is the fastest way to feed 8 uint8_t into popcnt64? Can I use some kind of casting? Should I make use of bit manipulation? I'm not aware of C++'s internal workings so I'm not sure what the fastest way is to make this conversion.

Solution

Assuming you don't care about endianness – you just want to treat the uint8_ts as a uint64_t and you don't care about the order of the uint8_ts – then you can just use std::memcpy to do the type punning:

std::uint64_t combine(std::array<std::uint8_t, 8> b) {
    static_assert(sizeof(b) == sizeof(std::uint64_t));
    static_assert(std::is_trivially_copyable_v<std::uint64_t>);
    static_assert(std::is_trivially_copyable_v<decltype(b)>);

    std::uint64_t result;
    std::memcpy(&result, b.data(), sizeof(result));
    return result;
}

The generated assembly just returns the argument:

combine(std::array<unsigned char, 8ul>): # @combine(std::array<unsigned char, 8ul>)
  mov rax, rdi
  ret

Using anything else for type punning makes it so you have to worry about strict aliasing rules or alignments of types. It's easy enough to just use std::memcpy and let the compiler deal with it

Note that the easiest way of calling any variant of popcnt from C++ is to use std::bitset::count. So instead of __builtin_popcountll(my_u64) or __popcnt64(my_u64), you could just write std::bitset<64>{my_u64}.count() and you instantly get portable code.