Search code examples
c++heap-memorynew-operatorstack-memoryautomatic-storage

Why is allocation on the heap faster than allocation on the stack?


As far as my knowledge on resource management goes, allocating something on the heap (operator new) should always be slower than allocating on the stack (automatic storage), because the stack is a LIFO-based structure, thus it requires minimal bookkeeping, and the pointer of the next address to allocate is trivial.

So far, so good. Now look at the following code:

/* ...includes... */

using std::cout;
using std::cin;
using std::endl;

int bar() { return 42; }

int main()
{
    auto s1 = std::chrono::steady_clock::now();
    std::packaged_task<int()> pt1(bar);
    auto e1 = std::chrono::steady_clock::now();

    auto s2 = std::chrono::steady_clock::now();
    auto sh_ptr1 = std::make_shared<std::packaged_task<int()> >(bar);
    auto e2 = std::chrono::steady_clock::now();

    auto first = std::chrono::duration_cast<std::chrono::nanoseconds>(e1-s1);
    auto second = std::chrono::duration_cast<std::chrono::nanoseconds>(e2-s2);

    cout << "Regular: " << first.count() << endl
         << "Make shared: " << second.count() << endl;

    pt1();
    (*sh_ptr1)();

    cout << "As you can see, both are working correctly: " 
         << pt1.get_future().get() << " & " 
         << sh_ptr1->get_future().get() << endl;

    return 0;
}

The results seem to contradict the stuff explained above:

Regular: 6131

Make shared: 843

As you can see, both are working correctly: 42 & 42

Program ended with exit code: 0

In the second measurement, apart from the call of operator new, the constructor of the std::shared_ptr (auto sh_ptr1) has to finish. I can't seem to understand why is this faster then regular allocation.

What is the explanation for this?


Solution

  • The problem is that the first call to the constructor of std::packaged_task is responsible for initializing a load of per-thread state that is then unfairly attributed to pt1. This is a common problem of benchmarking (particularly microbenchmarking) and is alleviated by warmup; try reading How do I write a correct micro-benchmark in Java?

    If I copy your code but run both parts first, the results are the same to within the limits of the resolution of the system clock. This demonstrates another issue of microbenchmarking, that you should run small tests multiple times to allow total time to be measured accurately.

    With warmup and running each part 1000 times, I get the following (example):

    Regular: 132.986
    Make shared: 211.889
    

    The difference (approx 80ns) accords well with the rule of thumb that malloc takes 100ns per call.