And again about STL std::bitset
- its documentation says that functions set/reset/test
do boundary checks, and operator[]
doesn't. My timing experiments show that functions set/test
typically perform 2-3% faster than the operator[]
. The code I'm working with is:
typedef unsigned long long U64;
const U64 MAX = 800000000ULL;
struct Bitmap1
{
void insert(U64 N) {this->s[N % MAX] = 1;}
bool find(U64 N) const {return this->s[N % MAX];}
private:
std::bitset<MAX> s; // <---- takes MAX/8 memory (in bytes)
};
struct Bitmap2
{
void insert(U64 N) {this->s.set(N % MAX);}
bool find(U64 N) const {return this->s.test(N % MAX);}
private:
std::bitset<MAX> s; // <---- takes MAX/8 memory (in bytes)
};
int main()
{
Bitmap2* s = new Bitmap2();
// --------------------------- storing
const size_t t0 = time(0);
for (unsigned k = 0; k < LOOPS; ++k)
{
for (U64 i = 0; i < MAX; ++i) s->insert(i);
}
cout << "storing: " << time(0) - t0 << endl;
// -------------------------------------- search
const size_t t1 = time(0);
U64 count = 0;
for (unsigned k = 0; k < LOOPS; ++k)
{
for (U64 i = 0; i < MAX; ++i) if (s->find(i)) ++count;
}
cout << "search: " << time(0) - t1 << endl;
cout << count << endl;
}
How to explain this? Absence of boundary checks should save us some cycles, right?
Compiler: g++ 4.8.1 (options -g -O4)
VMware VM: Ubuntu 3.11.0-15
Host: MacBook Pro
When I remove rand
, division, output, and the memory cache from the timings:
bool bracket_test() {
std::bitset<MAX> s;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s[i] = !s[MAX-1-i];
}
return s[0];
}
bool set_test() {
std::bitset<MAX> s;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s.set(i, !s.test(MAX-1-i));
}
return s.test(0);
}
bool no_test() {
bool s = false;
for(int j=0; j<num_iterations; ++j) {
for(int i=0; i<MAX; ++i)
s = !s;
}
return s;
}
I get these results with Clang at http://coliru.stacked-crooked.com/a/cdc832bfcc7e32be. (I do 10000 iterations, 20 times, and measure the lowest time, which mitigates timing errors.)
clang++ -std=c++11 -O0 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 178663845 ticks to find result 1
set_test took 117336632 ticks to find result 1
no_test took 9214297 ticks to find result 0
clang++ -std=c++11 -O1 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 798184780 ticks to find result 1
set_test took 565999680 ticks to find result 1
no_test took 41693575 ticks to find result 0
clang++ -std=c++11 -O2 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 81240369 ticks to find result 1
set_test took 72172912 ticks to find result 1
no_test took 41907685 ticks to find result 0
clang++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread main.cpp && ./a.out
bracket_test took 77688054 ticks to find result 1
set_test took 72433185 ticks to find result 1
no_test took 41433010 ticks to find result 0
Previous versions of this test found that brackets were slightly faster, but now that I've improved the accuracy of the timings, it appears that my margin of error for timing is approximately 3%. At O1 Set
is 35-54% faster, at O2 it's 13-49% faster, and at O3 it's 2-34% faster. This seems pretty conclusive to me, aside from looking at the assembly output.
So here's assembly (at GCC -O
) via http://assembly.ynh.io/:
std::bitset<MAX> s
s[1000000] = true;
return s;
0000 4889F8 movq %rdi, %rax
0003 4889FA movq %rdi, %rdx
0006 488D8F00 leaq 100000000(%rdi), %rcx
E1F505
000d 48C70200 movq $0, (%rdx)
000000
0014 4883C208 addq $8, %rdx
0018 4839CA cmpq %rcx, %rdx
001b 75F0 jne .L2
001d 48838848 orq $1, 125000(%rax)
E8010001
0025 C3 ret
and
std::bitset<MAX> s;
s.set(1000000);
return s;
0026 4889F8 movq %rdi, %rax
0029 4889FA movq %rdi, %rdx
002c 488D8F00 leaq 100000000(%rdi), %rcx
E1F505
0033 48C70200 movq $0, (%rdx)
000000
003a 4883C208 addq $8, %rdx
003e 4839CA cmpq %rcx, %rdx
0041 75F0 jne .L6
0043 48838848 orq $1, 125000(%rax)
E8010001
004b C3 ret
I can't really read assembly so well, but these are completely identical, so analysis of this case is easy. If the compiler knows they're both in range, it optimizes out the range check. When I replace the fixed index with a variable index, Set
adds 5 operations to check for the boundary case.
As for REASONS that Set
is faster sometimes, is that operator[]
has to do a TON of work for the reference proxy that Set
doesn't have to do. The reason that Set
is slower sometimes is that the proxy is trivially inlined, in which case the only difference is that Set
has to do the boundary check. On the other hand, Set
only has to do the boundary check if the compiler cannot prove that indexes are always in range. So it depends on the surrounding code, a lot. Your results may differ.
http://en.cppreference.com/w/cpp/utility/bitset/set says:
Sets the bit at position pos to the value value.
Throws std::out_of_range if pos does not correspond to a valid position within the bitset.
http://en.cppreference.com/w/cpp/utility/bitset/operator_at says:
Accesses the bit at position pos. Returns an object of type std::bitset::reference that allows modification of the value.
Unlike test(), does not throw exceptions: the behavior is undefined if pos is out of bounds.
and http://en.cppreference.com/w/cpp/utility/bitset/reference says:
The std::bitset class includes std::bitset::reference as a publicly-accessible nested class. This class is used as a proxy object to allow users to interact with individual bits of a bitset, since standard C++ types (like references and pointers) are not built with enough precision to specify individual bits. The primary use of std::bitset::reference is to provide an l-value that can be returned from operator[]. Any reads or writes to a bitset that happen via a std::bitset::reference potentially read or write to the entire underlying bitset.
It should be clear that operator[]
actually has a lot more to it than is intuitive.