performance language-agnostic bit-manipulation saturation-arithmetic

Fast saturating integer conversion?

I'm wondering if there is any fast bit twiddling trick to do a saturating conversion from a 64-bit unsigned value to a 32-bit unsigned value (it would be nice if it is generalized to other widths but that's the main width I care about). Most of the resources I've been able to find googling have been for saturating arithmetic operations.

A saturating conversion would take a 64-bit unsigned value, and return either the value unmodified as a 32-bit value or 2^32-1 if the input value is greater than 2^32-1. Note that this is not what the default C casting truncating behaviour does.

I can imagine doing something like:

Test if upper half has any bit set
If so create a 32-bit mask with all bits set, otherwise create a mask with all bits unset
Bitwise-or lower half with mask

But I don't know how to quickly generate the mask. I tried straightforward branching implementations in Godbolt to see if the compiler would generate a clever branchless implementation for me but no luck.

Implementation example here.

#include <stdint.h>
#include <limits.h>

// Type your code here, or load an example.
uint32_t square(uint64_t num) {
    return num > UINT32_MAX ? UINT32_MAX : num;
}

Edit: my mistake, issue was godbolt not set to use optimizations

Solution

You don't need to do any fancy bit twiddling trick to do this. The following function should be enough for compilers to generate efficient code:

uint32_t saturate(uint64_t value) {
    return value > UINT32_MAX ? UINT32_MAX : value;
}

This contains a conditional statement, but most common CPUs, like AMD/Intel and Arm ones, have conditional move instructions. So they will test the value for overflowing 32 bits, and based on the test they will replace it with UINT32_MAX, otherwise leave it alone. For example, on 64-bit Arm processors this function will be compiled by GCC (to:

saturate:
  mov x1, 4294967295
  cmp x0, x1
  csel x0, x0, x1, ls
  ret

Note that you must enable compiler optimizations to get the above result.