Most of the clz()
(SW impl.) are optimized for 32 bit unsigned integer.
How to efficiently count leading zeros in a 24 bit unsigned integer?
UPD. Target's characteristics:
CHAR_BIT 24
sizeof(int) 1
sizeof(long int) 2
sizeof(long long int) 3
TL;DR: See point 4 below for the C program.
Assuming your hypothetical target machine is capable of correctly implementing unsigned 24-bit multiplication (which must return the low-order 24 bits of the product), you can use the same trick as is shown in the answer you link. (But you might not want to. See [Note 1].) It's worth trying to understand what's going on in the linked answer.
The input is reduced to a small set of values, where all integers with the same number of leading zeros map to the same value. The simple way of doing that is to flood every bit to cover all the bit positions to the right of it:
x |= x>>1;
x |= x>>2;
x |= x>>4;
x |= x>>8;
x |= x>>16;
That will work for 17 up to 32 bits; if your target datatype has 9 to 16 bits, you could leave off the last shift-and-or because there is no bit position 16 bits to the right of any bit. And so on. But with 24 bits, you'll want all five shift-and-or.
With that, you've turned x into one of 25 values (for 24-bit ints):
x clz x clz x clz x clz x clz
-------- --- -------- --- -------- --- -------- --- -------- ---
0x000000 24 0x00001f 19 0x0003ff 14 0x007fff 9 0x0fffff 4
0x000001 23 0x00003f 18 0x0007ff 13 0x00ffff 8 0x1fffff 3
0x000003 22 0x00007f 17 0x000fff 12 0x01ffff 7 0x3fffff 2
0x000007 21 0x0000ff 16 0x001fff 11 0x03ffff 6 0x7fffff 1
0x00000f 20 0x0001ff 15 0x003fff 10 0x07ffff 5 0xffffff 0
Now, to turn x into clz, we need a good hash function. We don't necessarily expect that hash(x)==clz, but we want the 25 possible x values to hash to different numbers, ideally in a small range. As with the link you provide, the hash function we'll choose is to multiply by a carefully-chosen multiplicand and then mask off a few bits. Using a mask means that we need to choose five bits; in theory, we could use a 5-bit mask anywhere in the 24-bit word, but in order to not have to think too much, I just chose the five high-order bits, the same as the 32-bit solution. Unlike the 32-bit solution, I didn't bother adding 1, and I expect to distinct values for all 25 possible inputs. The equivalent isn't possible with a five-bit mask and 33 possible clz values (as in the 32-bit case), so they have to jump through an additional hoop if the original input was 0.
Since the hash function doesn't directly produce the clz value, but rather a number between 0 and 31, we need to translate the result to a clz value, which uses a 32-byte lookup table, called debruijn
in the 32-bit algorithm for reasons I'm not going to get into.
An interesting question is how to select a multiplier with the desired characteristics. One possibility would be to do a bunch of number theory to elegantly discover a solution. That's how it was done decades ago, but these days I can just write a quick-and-dirty Python program to do a brute force search over all the possible multipliers. After all, in the 24-bit case there are only about 16 million possibilities and lots of them work. The actual Python code I used is:
# Compute the 25 target values
targ=[2**i - 1 for i in range(25)]
# For each possible multiplier, compute all 25 hashes, and see if they
# are all different (that is, the set of results has size 25):
next(i for i in range(2**19, 2**24)
if len(targ)==len(set(((i * t) >> 19) & 0x1f
for t in targ)))
Calling next
on a generator expression returns the first generated value, which in this case is 0x8CB4F, or 576335. Since the search starts at 0x80000 (which is the smallest multiplier for which hash(1) is not 0), the result printed instantly. I then spent a few more milliseconds to generate all the possible multipliers between 219 and 220, of which there are 90, and selected 0xCAE8F (831119) for purely personal aesthetic reasons.
The last step is to create the lookup table from the computed hash function. (Not saying this is good Python. I just took it from my command history; I might come back and clean it up later. But I included it for completeness.):
lut = dict((i,-1) for i in range(32))
lut.update((((v * 0xcae8f) >> 19) & 0x1f, 24 - i)
for i, v in enumerate(targ))
print(" static const char lut[] = {\n " +
",\n ".join(', '.join(f"{lut[i]:2}" for i in range(j, j+8))
for j in range(0, 32, 8)) +
"\n };\n")
# The result is pasted into the C code below.
So then it's just a question of assembling the C code:
// Assumes that `unsigned int` has 24 value bits.
int clz(unsigned x) {
static const char lut[] = {
24, 23, 7, 18, 22, 6, -1, 9,
-1, 17, 15, 21, 13, 5, 1, -1,
8, 19, 10, -1, 16, 14, 2, 20,
11, -1, 3, 12, 4, -1, 0, -1
};
x |= x>>1;
x |= x>>2;
x |= x>>4;
x |= x>>8;
x |= x>>16;
return lut[((x * 0xcae8f) >> 19) & 0x1f];
}
The test code calls clz
on every 24-bit integer in turn. Since I don't have a 24-bit machine handy, I just assume that the arithmetic will work the same on the hypothetical 24-bit machine in the OP.
#include <stdio.h>
# For each 24-bit integer in turn (from 0 to 2**24-1), if
# clz(i) is different from clz(i-1), print clz(i) and i.
#
# Expected output is 0 and the powers of 2 up to 2**23, with
# descending clz values from 24 to 0.
int main(void) {
int prev = -1;
for (unsigned i = 0; i < 1<<24; ++i) {
int pfxlen = clz(i);
if (pfxlen != prev) {
printf("%2d 0x%06X\n", pfxlen, i);
prev = pfxlen;
}
}
return 0;
}
If the target machine does not implement 24-bit unsigned multiply in hardware --i.e., it depends on a software emulation-- then it's almost certainly faster to do the clz by just looping over initial bits, particularly if you fold the loop by scanning several bits at a time with a lookup table. That might be faster even if the machine does do efficient hardware multiplies. For example, you can scan six bits at a time with a 32-entry table:
// Assumes that `unsigned int` has 24 value bits.
int clz(unsigned int x) {
static const char lut[] = {
5, 4, 3, 3, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0
};
/* Six bits at a time makes octal easier */
if (x & 077000000u) return lut[x >> 19];
if (x & 0770000u) return lut[x >> 13] + 6;
if (x & 07700u) return lut[x >> 7] + 12;
if (x ) return lut[x >> 1] + 18;
return 24;
}
That table could be reduced to 48 bits but the extra code would likely eat up the savings.
A couple of clarifications seem to be in order here. First, although we're scanning six bits at a time, we only use five of them to index the table. That's because we've previously verified that the six bits in question are not all zero; in that case, the low-order bit is either irrelevant (if some other bit is set) or it's 1. Also, we get the table index by shifting without masking; the masking is unnecessary because we know from the masked tests that all the higher order bits are 0. (This will, however, fail miserably if x
has more than 24 bits.)