I have made a tiny benchmark for adding new elements to vector which I know its size.
Code:
struct foo{
foo() = default;
foo(double x, double y, double z) :x(x), y(y), z(y){
}
double x;
double y;
double z;
};
void resize_and_index(){
std::vector<foo> bar(1000);
for (auto& item : bar){
item.x = 5;
item.y = 5;
item.z = 5;
}
}
void reserve_and_push(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(foo(5, 5, 5));
}
}
void reserve_and_push_move(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.push_back(std::move(foo(5, 5, 5)));
}
}
void reserve_and_embalce(){
std::vector<foo> bar;
bar.reserve(1000);
for (size_t i = 0; i < 1000; i++)
{
bar.emplace_back(5, 5, 5);
}
}
I have then call each method 100000 times.
results:
resize_and_index: 176 mSec
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec
reserve_and_embalce: 143 mSec
Calling code:
const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;
My questions:
Benchmarking conditions:
Another Machine (By horstling):
VS2013, Win7, Xeon 1241 @ 3.5 Ghz
resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec
First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.
Second, the resize_and_index variant might be more optimal if you write the loop's body as
item = foo(5, 5, 5);
although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.
Third, you should also try this:
std::vector<foo> v(100, foo(5, 5, 5));
Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.
Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.
The first version does the following:
The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.
The second and third versions do the following:
There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.
But unless the compiler does some of these, I would expect this version to be a lot slower than the others.
The fourth version does the following:
This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.
Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.
In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)