`std::bitset` with and without boundary checks

And again about STL std::bitset - its documentation says that functions set/reset/test do boundary checks, and operator[] doesn't. My timing experiments show that functions set/test typically perform 2-3% faster than the operator[]. The code I'm working with is:

typedef unsigned long long U64;
const U64 MAX = 800000000ULL;

struct Bitmap1
{
  void insert(U64 N) {this->s[N % MAX] = 1;}
  bool find(U64 N) const {return this->s[N % MAX];}
private:
  std::bitset<MAX> s;  // <---- takes MAX/8 memory (in bytes)
};

struct Bitmap2
{
  void insert(U64 N) {this->s.set(N % MAX);}
  bool find(U64 N) const {return this->s.test(N % MAX);}
private:
  std::bitset<MAX> s;  // <---- takes MAX/8 memory (in bytes)
};

int main()
{
  Bitmap2* s = new Bitmap2();
  // --------------------------- storing
  const size_t t0 = time(0);
  for (unsigned k = 0; k < LOOPS; ++k)
  {
    for (U64 i = 0; i < MAX; ++i) s->insert(i);
  }
  cout << "storing: " << time(0) - t0 << endl;
  // -------------------------------------- search
  const size_t t1 = time(0);
  U64 count = 0;
  for (unsigned k = 0; k < LOOPS; ++k)
  {
    for (U64 i = 0; i < MAX; ++i) if (s->find(i)) ++count;
  }
  cout << "search:  " << time(0) - t1 << endl;
  cout << count << endl;
}

How to explain this? Absence of boundary checks should save us some cycles, right?

Compiler: g++ 4.8.1 (options -g -O4)
VMware VM: Ubuntu 3.11.0-15
Host: MacBook Pro

Solution

When I remove rand, division, output, and the memory cache from the timings:

bool bracket_test() {
    std::bitset<MAX> s;
    for(int j=0; j<num_iterations; ++j) {
        for(int i=0; i<MAX; ++i)
            s[i] = !s[MAX-1-i];
    }
    return s[0];
}
bool set_test() {
    std::bitset<MAX> s;
    for(int j=0; j<num_iterations; ++j) {
        for(int i=0; i<MAX; ++i)
            s.set(i, !s.test(MAX-1-i));
    }
    return s.test(0);
}
bool no_test() {
    bool s = false;
    for(int j=0; j<num_iterations; ++j) {
        for(int i=0; i<MAX; ++i)
            s = !s;
    }
    return s;
}

I get these results with Clang at http://coliru.stacked-crooked.com/a/cdc832bfcc7e32be. (I do 10000 iterations, 20 times, and measure the lowest time, which mitigates timing errors.)

clang++ -std=c++11  -O0 -Wall -Wextra -pedantic -pthread main.cpp  && ./a.out
bracket_test took 178663845 ticks to find result 1
set_test     took 117336632 ticks to find result 1
no_test      took 9214297 ticks to find result 0
clang++ -std=c++11  -O1 -Wall -Wextra -pedantic -pthread main.cpp  && ./a.out
bracket_test took 798184780 ticks to find result 1
set_test     took 565999680 ticks to find result 1
no_test      took 41693575 ticks to find result 0
clang++ -std=c++11  -O2 -Wall -Wextra -pedantic -pthread main.cpp  && ./a.out
bracket_test took 81240369 ticks to find result 1
set_test     took 72172912 ticks to find result 1
no_test      took 41907685 ticks to find result 0
clang++ -std=c++11  -O3 -Wall -Wextra -pedantic -pthread main.cpp  && ./a.out
bracket_test took 77688054 ticks to find result 1
set_test     took 72433185 ticks to find result 1
no_test      took 41433010 ticks to find result 0

Previous versions of this test found that brackets were slightly faster, but now that I've improved the accuracy of the timings, it appears that my margin of error for timing is approximately 3%. At O1 Set is 35-54% faster, at O2 it's 13-49% faster, and at O3 it's 2-34% faster. This seems pretty conclusive to me, aside from looking at the assembly output.

So here's assembly (at GCC -O) via http://assembly.ynh.io/:

std::bitset<MAX> s
s[1000000] = true;
return s;

0000 4889F8         movq    %rdi, %rax
0003 4889FA         movq    %rdi, %rdx
0006 488D8F00       leaq    100000000(%rdi), %rcx
     E1F505
000d 48C70200       movq    $0, (%rdx)
     000000
0014 4883C208       addq    $8, %rdx
0018 4839CA         cmpq    %rcx, %rdx
001b 75F0           jne .L2
001d 48838848       orq $1, 125000(%rax)
     E8010001 
0025 C3             ret

and

std::bitset<MAX> s;
s.set(1000000);
return s;

0026 4889F8         movq    %rdi, %rax
0029 4889FA         movq    %rdi, %rdx
002c 488D8F00       leaq    100000000(%rdi), %rcx
     E1F505
0033 48C70200       movq    $0, (%rdx)
     000000
003a 4883C208       addq    $8, %rdx
003e 4839CA         cmpq    %rcx, %rdx
0041 75F0           jne .L6
0043 48838848       orq $1, 125000(%rax)
     E8010001 
004b C3             ret

I can't really read assembly so well, but these are completely identical, so analysis of this case is easy. If the compiler knows they're both in range, it optimizes out the range check. When I replace the fixed index with a variable index, Set adds 5 operations to check for the boundary case.

As for REASONS that Set is faster sometimes, is that operator[] has to do a TON of work for the reference proxy that Set doesn't have to do. The reason that Set is slower sometimes is that the proxy is trivially inlined, in which case the only difference is that Set has to do the boundary check. On the other hand, Set only has to do the boundary check if the compiler cannot prove that indexes are always in range. So it depends on the surrounding code, a lot. Your results may differ.

http://en.cppreference.com/w/cpp/utility/bitset/set says:

Sets the bit at position pos to the value value.
Throws std::out_of_range if pos does not correspond to a valid position within the bitset.

http://en.cppreference.com/w/cpp/utility/bitset/operator_at says:

Accesses the bit at position pos. Returns an object of type std::bitset::reference that allows modification of the value.
Unlike test(), does not throw exceptions: the behavior is undefined if pos is out of bounds.

and http://en.cppreference.com/w/cpp/utility/bitset/reference says:

The std::bitset class includes std::bitset::reference as a publicly-accessible nested class. This class is used as a proxy object to allow users to interact with individual bits of a bitset, since standard C++ types (like references and pointers) are not built with enough precision to specify individual bits. The primary use of std::bitset::reference is to provide an l-value that can be returned from operator[]. Any reads or writes to a bitset that happen via a std::bitset::reference potentially read or write to the entire underlying bitset.

It should be clear that operator[] actually has a lot more to it than is intuitive.