Search code examples
c++arraysalgorithmsumbitset

Using a bitset to get a conditional sum of array items without using loop


There are a bitset b and an array arr. I would like to get the sum of the array by using the bitset, so that arr[i] is summed only if b[i] is set.

Example:

bitset<4> b("1001"); 
arr[4] = {5,4,3,1};

===> the sum would be 1+5 = 6. 

Obviously, I could use a loop. But this kind of computation is repeated multiple times. Is there a more effective way without a loop ?


Solution

  • The fastest way would probably to iterate through a loop and add depending if the current bit is set.

    Perhaps the use of bit shifting instead of accessing directly to bit n could avoid some hidden gymnastics by the standard library to extract a specific bit, but this is yet to be proven by benchmarking.:

    int sum=0;
    for (size_t i=b.size(); i; i--, b>>=1) 
        if (b[0])
            sum+=arr[i-1];
    

    For the completeness and the fun, if you could live with a vector<bool> instead of a bitset<>, you could just use the standard library:

    vector<bool> b{1,0,0,1}; 
    int arr[4] = {5,4,3,1};
    int test=inner_product(begin(arr), end(arr), b.begin(), 0); 
    

    Online demo


    Edit: additional infos - trust your optimizing compiler

    I experimented on godbolt online compiler with gcc 7.2 for x86-64 platform to look at the assembly produced by the different variants above.

    The simplest solution, iterating trough the loop and conditionally adding the items is extremely fast:

    • With constant values as in your example, the compiler is able via constant propagation and determine the result of 6 immediately (see here).
    • If you have small size data structures, the compiler will automatically unroll the loop, making it lightening fast (14 sequential assembler instruction for 4 elements), without any loop index to maintain.
    • However the optimizer may choose not to unroll the loop for larger sizes. In my experiment, up to up to a size of 17 items the loop is unrolled, with 18 elements it's a real loop)

    The bit shifting alternative is very close:

    • Up to 17 items, the loop is unrolled and the code generated is exacly the same as with the simple loop.
    • From 18 elements onwards, a loop is generated, which has one assembly instruction less per iteration. It seems a little fastern but only a benchmark could really make the difference which might be in the nanosecond range.

    Of course, the inner_product is much heavier, because it really makes a multiplication which require to convert the bools to ints, and it has to cope with a dynamic size for the bool vector.