Benchmarking adding elements to vector when size is known

I have made a tiny benchmark for adding new elements to vector which I know its size.

Code:

struct foo{
    foo() = default;
    foo(double x, double y, double z) :x(x), y(y), z(y){

    }
    double x;
    double y;
    double z;
};

void resize_and_index(){
    std::vector<foo> bar(1000);
    for (auto& item : bar){
        item.x = 5;
        item.y = 5;
        item.z = 5;
    }
}

void reserve_and_push(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.push_back(foo(5, 5, 5));
    }
}

void reserve_and_push_move(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.push_back(std::move(foo(5, 5, 5)));
    }
}

void reserve_and_embalce(){
    std::vector<foo> bar;
    bar.reserve(1000);
    for (size_t i = 0; i < 1000; i++)
    {
        bar.emplace_back(5, 5, 5);
    }
}

I have then call each method 100000 times.

results:

resize_and_index: 176 mSec 
reserve_and_push: 560 mSec
reserve_and_push_move: 574 mSec 
reserve_and_embalce: 143 mSec

Calling code:

const size_t repeate = 100000;
auto start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    resize_and_index();
}
auto stop_time = clock();
std::cout << "resize_and_index: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_push();
}
stop_time = clock();
std::cout << "reserve_and_push: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_push_move();
}
stop_time = clock();
std::cout << "reserve_and_push_move: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;


start_time = clock();
for (size_t i = 0; i < repeate; i++)
{
    reserve_and_embalce();
}
stop_time = clock();
std::cout << "reserve_and_embalce: " << (stop_time - start_time) / double(CLOCKS_PER_SEC) * 1000 << " mSec" << std::endl;

My questions:

Why did I get these results? what make emplace_back superior to others?
Why does std::move make the performance slightly worse ?

Benchmarking conditions:

Compiler: VS.NET 2013 C++ compiler (/O2 Max speed Optimization)
OS : Windows 8
Processor: Intel Core i7-410U CPU @ 2.00 GHZ

Another Machine (By horstling):

VS2013, Win7, Xeon 1241 @ 3.5 Ghz

resize_and_index: 144 mSec
reserve_and_push: 199 mSec
reserve_and_push_move: 201 mSec
reserve_and_embalce: 111 mSec

Solution

First, reserve_and_push and reserve_and_push_move are semantically equivalent. The temporary foo you construct is already an rvalue (the rvalue reference overload of push_back is already used); wrapping it in a move does not change anything, except possibly obscure the code for the compiler, which could explain the slight performance loss. (Though I think it more likely to be noise.) Also, your class has identical copy and move semantics.

Second, the resize_and_index variant might be more optimal if you write the loop's body as

item = foo(5, 5, 5);

although only profiling will show that. The point is that the compiler might generate suboptimal code for the three separate assignments.

Third, you should also try this:

std::vector<foo> v(100, foo(5, 5, 5));

Fourth, this benchmark is extremely sensitive to the compiler realizing that none of these functions actually do anything and simply optimizing their complete bodies out.

Now for analysis. Note that if you really want to know what's going on, you'll have to inspect the assembly the compiler generates.

The first version does the following:

Allocate space for 1000 foos.
Loop and default-construct each one.
Loop over all elements and reassign the values.

The main question here is whether the compiler realizes that the constructor in the second step is a no-op and that it can omit the entire loop. Assembly inspection can show that.

The second and third versions do the following:

Allocate space for 1000 foos.
1000 times:
1. construct a temporary foo object
2. ensure there is still enough allocated space
3. move (for your type, equivalent to a copy, since your class doesn't have special move semantics) the temporary into the allocated space.
4. Increment the vector's size.

There is a lot of room for optimization here for the compiler. If it inlines all operations into the same function, it could realize that the size check is superfluous. It could then realize that your move constructor cannot throw, which means the entire loop is uninterruptible, which means it could merge all the increments into one assignment. If it doesn't inline the push_back, it has to place the temporary in memory and pass a reference to it; there's a number of ways it could special-case this to be more efficient, but it's unlikely that it will.

But unless the compiler does some of these, I would expect this version to be a lot slower than the others.

The fourth version does the following:

Allocate enough space for 1000 foos.
1000 times:
1. ensure there is still enough allocated space
2. create a new object in the allocated space, using the constructor with the three arguments
3. increment the size

This is similar to the previous, with two differences: first, the way the MS standard library implements push_back, it has to check whether the passed reference is a reference into the vector itself; this greatly increases complexity of the function, inhibiting inlining. emplace_back does not have this problem. Second, emplace_back gets three simple scalar arguments instead of a reference to a stack object; if the function doesn't get inlined, this is significantly more efficient to pass.

Unless you work exclusively with Microsoft's compiler, I would strongly recommend you compare with other compilers (and their standard libraries). I also think that my suggested version would beat all four of yours, but I haven't profiled this.

In the end, unless the code is really performance-sensitive, you should write the version that is most readable. (That's another place where my version wins, IMO.)