Is it more efficient to branch or multiply?

I am trying to optimize a small, highly used function which uses the high bits in an unsigned short int to indicate the values of an array to sum together. At first I was using the obvious approach shown below. Please note that loop unrolling is not explicitly shown as it should be done by the compiler.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    if (i & mask){
        total += value[j];
    }
}

However, later I thought it might be better to remove the branching to help CPU pipelining and came up with the following.

int total = 0;
for(unsigned short mask = 0x0001, j = 0; mask != 0; mask <<= 1, j++){
    total += ((i & mask) != 0) * value[j];
}

Note that since (i & mask) does not result in a boolean answer, the comparison with 0 forces the result to be either 1 or 0. Although this second approach eliminates the if-statement from this section of the code, the second solution needs to run a multiplication of 0 or 1 on every iteration in addition to the rest of the equation.

Which code will run faster?

Solution

You could make it branchless without a multiply. It looks like for each bit set you are using that bit position as an index into an array.

First, you can easily extract bits set with:

unsigned short set_mask= i & -i;
i&= i - 1;

Then, you can get the bit index by counting the bits set in (set_mask - 1). There's a constant time formula for this.

Some platforms also have an intrinsic to get the bit index of a bit set which is probably faster. x86 has bsr, PPC has cntlz.

So the answer is the branchless multiplyless version is probably fastest :)