Using a bitset to get a conditional sum of array items without using loop

There are a bitset b and an array arr. I would like to get the sum of the array by using the bitset, so that arr[i] is summed only if b[i] is set.

Example:

bitset<4> b("1001"); 
arr[4] = {5,4,3,1};

===> the sum would be 1+5 = 6.

Obviously, I could use a loop. But this kind of computation is repeated multiple times. Is there a more effective way without a loop ?

Solution

The fastest way would probably to iterate through a loop and add depending if the current bit is set.

Perhaps the use of bit shifting instead of accessing directly to bit n could avoid some hidden gymnastics by the standard library to extract a specific bit, but this is yet to be proven by benchmarking.:

int sum=0;
for (size_t i=b.size(); i; i--, b>>=1) 
    if (b[0])
        sum+=arr[i-1];

For the completeness and the fun, if you could live with a vector<bool> instead of a bitset<>, you could just use the standard library:

vector<bool> b{1,0,0,1}; 
int arr[4] = {5,4,3,1};
int test=inner_product(begin(arr), end(arr), b.begin(), 0);

Online demo

Edit: additional infos - trust your optimizing compiler

I experimented on godbolt online compiler with gcc 7.2 for x86-64 platform to look at the assembly produced by the different variants above.

The simplest solution, iterating trough the loop and conditionally adding the items is extremely fast:

With constant values as in your example, the compiler is able via constant propagation and determine the result of 6 immediately (see here).
If you have small size data structures, the compiler will automatically unroll the loop, making it lightening fast (14 sequential assembler instruction for 4 elements), without any loop index to maintain.
However the optimizer may choose not to unroll the loop for larger sizes. In my experiment, up to up to a size of 17 items the loop is unrolled, with 18 elements it's a real loop)

The bit shifting alternative is very close:

Up to 17 items, the loop is unrolled and the code generated is exacly the same as with the simple loop.
From 18 elements onwards, a loop is generated, which has one assembly instruction less per iteration. It seems a little fastern but only a benchmark could really make the difference which might be in the nanosecond range.

Of course, the inner_product is much heavier, because it really makes a multiplication which require to convert the bools to ints, and it has to cope with a dynamic size for the bool vector.