I have a program in which it is necessary to calculate the floor of the log-base-2 of an integer very frequently. As a reuslt, the performance of the standard library's log2 method in my language of choice (floor(std::log2([INT]))
from <cmath>
in C++) is unsatisfactory, and I would like to implement the quickest version of this algorithm possible. I have found versions online which use bitwise operators to calculate this value for 32-bit and 64-bit integers:
INT Y(log2i)(const INT m)
{
/* Special case, zero or negative input. */
if (m <= 0)
return -1;
#if SIZEOF_PTRDIFF_T == 8
/* Hash map with return values based on De Bruijn sequence. */
static INT debruijn[64] =
{
0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3, 61, 51, 37, 40, 49,
18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4, 62, 57, 46, 52, 38, 26, 32, 41,
50, 36, 17, 19, 29, 10, 13, 21, 56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15,
8, 23, 7, 6, 5, 63
};
register uint64_t v = (uint64_t)(m);
/* Round down to one less than a power of 2. */
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
v |= v >> 32;
/* 0x03f6eaf2cd271461 is a hexadecimal representation of a De Bruijn
* sequence for binary words of length 6. The binary representation
* starts with 000000111111. This is required to make it work with one less
* than a power of 2 instead of an actual power of 2.
*/
return debruijn[(uint64_t)(v * 0x03f6eaf2cd271461LU) >> 58];
#elif SIZEOF_PTRDIFF_T == 4
/* Hash map with return values based on De Bruijn sequence. */
static INT debruijn[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28,
15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
register uint32_t v = (uint32_t)(m);
/* Round down to one less than a power of 2. */
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
/* 0x07C4ACDD is a hexadecimal representation of a De Bruijn sequence for
* binary words of length 5. The binary representation starts with
* 0000011111. This is required to make it work with one less than a power of
* 2 instead of an actual power of 2.
*/
return debruijn[(uint32_t)(v * 0x07C4ACDDU) >> 27];
#else
#error Incompatible size of ptrdiff_t
#endif
}
(Above code taken from this link; the comments of said code reference this page, which gives a brief overview of how the algorithm works).
I need to implement a version of this algorithm for 256-bit integers. The general form, for an n-bit integer, is fairly easy to understand: (1) create an array of integers from the Debruijn sequence with n entries; (2) perform in-place bitwise-or on the integer in question right-shifted by 2^i for i=1...(n/2); and (3) return some the entry of the Debruijn array with an index equal to the integer multiplied by a constant right-shifted by another constant.
The third step is where I'm confused. How exactly does one derive 0x07C4ACDDU
and 0x03f6eaf2cd271461LU
as the multiplication constants for 32 and 64 bits, respectively? How does one derive 58
and 27
as the constants by which one should right-shift? What would these values be for a 256-bit integer in particular?
Thanks in advance. Sorry if this is obvious, I'm not very educated in math.
I agree with harold that std::countl_zero()
is the way to go. Memory
has gotten a lot slower relative to compute since this bit-twiddling
hack was designed, and processors typically have built-in instructions.
To answer your question, however, this hack combines a couple primitives. (When I talk about bit indexes, I'm counting from most to least significant, starting the count at zero.)
The sequence of lines starting with v |= v >> 1;
achieves its
stated goal of rounding up to the nearest power of two minus one
(i.e., a number whose binary representation matches 0*1*
) by
setting every bit to the right of at least one set bit.
v
.i
, we observe that a bit at
position i + delta
will be guaranteed set by the lines
matching delta
's binary representation, e.g., delta = 13
(1101 in binary) is handled by
v |= v >> 1; v |= v >> 4; v |= v >> 8;
.Extracting bits [L, L+delta)
from an unsigned integer n
with
WIDTH
bits can be accomplished with (n << L) >> (WIDTH - delta)
.
The left shift truncates the upper bits that should be discarded,
and the right shift, which is logical in C++ for unsigned, truncates
the lower bits and fills the truncated upper bits with zeros.
Given that the answer is k
, we want to extract 5 (= log2(32), for
32-bit) or 6 (= log2(64), for 64-bit) bits starting with index k
from the magic constant n
. We can't shift by k
because we only
know pow2(k)
(sort of, I'll get to that in a second), but we can
use the equivalence between multiplying by pow2(k)
and left
shifting by k
as a workaround.
Actually we only know pow2(k+1) - 1
. We're going to be greedy and
shave the two ops that we'd need to get pow2(k)
. By putting 5 or 6
ones after the initial zeros, we force that minus one to always
cause the answer to be one less than it should have been (mod 32 or
64).
So the de Bruijn sequence: the idea is that we can uniquely identify our index in the sequence by reading the next 5 or 6 bits. We aren't so lucky as to be able to have these bits be equal to the index, which is where the look-up table comes in.
As an example, I'll adapt this algorithm to 8-bit words. We start with
v |= v >> 1;
v |= v >> 2;
v |= v >> 4;
The de Bruijn sequence is 00011101
, which written out in three-bit
segments is
(index − 1) mod 8 | bits | value | (value − 1) mod 8 |
---|---|---|---|
7 | 000 | 0 | 7 |
0 | 001 | 1 | 0 |
1 | 011 | 3 | 2 |
2 | 111 | 7 | 6 |
3 | 110 | 6 | 5 |
4 | 101 | 5 | 4 |
5 | 010 | 2 | 1 |
6 | 100 | 4 | 3 |
The hex constant is 0x1D
, the right shift is 8 − log2(8) = 5, the
table is derived by inverting the permutation above:
{0, 5, 1, 6, 4, 3, 2, 7}
.
So, hypothetically, if we wanted to adapt this algorithm to a 256-bit
word size, we'd add v |= v >> 64; v |= v >> 128;
, change the shift to
256 − log2(256) = 256 − 8 = 248, find a 256-bit de Bruijn sequence that
starts with 0000000011111111
, encode it as a hex constant, and
construct the appropriate look-up table to pair with it.
But like, don't. If you insist on not using the library function, you're still probably on a 64-bit machine, so you should just test whether each of the four words from big to little is nonzero, and if so, apply the 64-bit code and add the appropriate offset.