I am trying to optimize a small, highly used function which uses the high bits in an unsigned short int to indicate the values of an array to sum together. At first I was using the obvious approach shown below. Please note that loop unrolling is not explicitly shown as it should be done by the compiler.
int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
if (i & mask){
total += value[j];
}
}
However, later I thought it might be better to remove the branching to help CPU pipelining and came up with the following.
int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
total += ((i & mask) != 0) * value[j];
}
Note that since (i & mask) does not result in a boolean answer, the comparison with 0 forces the result to be either 1 or 0. Although this second approach eliminates the if-statement from this section of the code, the second solution needs to run a multiplication of 0 or 1 on every iteration in addition to the rest of the equation.
Which code will run faster?
You could make it branchless without a multiply. It looks like for each bit set you are using that bit position as an index into an array.
First, you can easily extract bits set with:
unsigned short set_mask= i & -i;
i&= i - 1;
Then, you can get the bit index by counting the bits set in (set_mask - 1)
. There's a constant time formula for this.
Some platforms also have an intrinsic to get the bit index of a bit set which is probably faster. x86 has bsr
, PPC has cntlz
.
So the answer is the branchless multiplyless version is probably fastest :)