c x86 bit-manipulation bit-shift bitmask

what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?

here are some cases:

left side is the original bits - right side after the flips.

00011101 -> 00000010
00000000 -> 00000000
11111111 -> 00000000
11110111 -> 00001000
01000000 -> 00111111

[EDIT]

The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.

Solution

In general I expect that most solutions will have roughly this form:

Compute the mask of bits that need to flipped
XOR by that mask

As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:

Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.

There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).

Putting it together:

x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))

Which could be further simplified to

_bzhi_u64(~x,  64 - _lzcnt_u64(x))

As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.

Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 2⁶⁴, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.

Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:

// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
//   even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;  // last step should be removed if x is 32-bit

AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.

This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.