c++multithreading c++11 asynchronous stdasync

Why is std::async slow compared to simple detached threads?

I've been told several times, that I should use std::async for fire & forget type of tasks with the std::launch::async parameter (so it does it's magic on a new thread of execution preferably).

Encouraged by these statements, I wanted to see how std::async is compared to:

sequential execution
a simple detached std::thread
my simple async "implementation"

My naive async implementation looks like this:

template <typename F, typename... Args>
auto myAsync(F&& f, Args&&... args) -> std::future<decltype(f(args...))>
{
    std::packaged_task<decltype(f(args...))()> task(std::bind(std::forward<F>(f), std::forward<Args>(args)...));
    auto future = task.get_future();

    std::thread thread(std::move(task));
    thread.detach();

    return future;
}

Nothing fancy here, packs the functor f into an std::packaged task along with its arguments, launches it on a new std::thread which is detached, and returns with the std::future from the task.

And now the code measuring execution time with std::chrono::high_resolution_clock:

int main(void)
{
    constexpr unsigned short TIMES = 1000;

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        someTask();
    }
    auto dur = std::chrono::high_resolution_clock::now() - start;

    auto tstart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        std::thread t(someTask);
        t.detach();
    }
    auto tdur = std::chrono::high_resolution_clock::now() - tstart;

    std::future<void> f;
    auto astart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = std::async(std::launch::async, someTask);
    }
    auto adur = std::chrono::high_resolution_clock::now() - astart;

    auto mastart = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < TIMES; ++i)
    {
        f = myAsync(someTask);
    }
    auto madur = std::chrono::high_resolution_clock::now() - mastart;

    std::cout << "Simple: " << std::chrono::duration_cast<std::chrono::microseconds>(dur).count() <<
    std::endl << "Threaded: " << std::chrono::duration_cast<std::chrono::microseconds>(tdur).count() <<
    std::endl << "std::sync: " << std::chrono::duration_cast<std::chrono::microseconds>(adur).count() <<
    std::endl << "My async: " << std::chrono::duration_cast<std::chrono::microseconds>(madur).count() << std::endl;

    return EXIT_SUCCESS;
}

Where someTask() is a simple method, where I wait a little, simulating some work done:

void someTask()
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

Finally, my results:

Sequential: 1263615
Threaded: 47111
std::sync: 821441
My async: 30784

Could anyone explain these results? It seems like std::aysnc is much slower than my naive implementation, or just plain and simple detached std::threads. Why is that? After these results is there any reason to use std::async?

(Note that I did this benchmark with clang++ and g++ too, and the results were very similar)

UPDATE:

After reading Dave S's answer I updated my little benchmark as follows:

std::future<void> f[TIMES];
auto astart = std::chrono::high_resolution_clock::now();
for (int i = 0; i < TIMES; ++i)
{
    f[i] = std::async(std::launch::async, someTask);
}
auto adur = std::chrono::high_resolution_clock::now() - astart;

So the std::futures are now not destroyed - and thus joined - every run. After this change in the code, std::async produces similar results to my implementation & detached std::threads.

Solution

One key difference is that the future returned by async joins the thread when the future is destroyed, or in your case, replaced with a new value.

This means it has to execute someTask() and join the thread, both of which take time. None of your other tests are doing that, where they simply spawn them independently.