std::thread to std::async makes HUGE performance gain. How it can be possible?

I`ve made a test code between std::thread and std::async.

#include <iostream>
#include <mutex>
#include <fstream>
#include <string>
#include <memory>
#include <thread>
#include <future>
#include <functional>
#include <boost/noncopyable.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/filesystem.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <boost/asio.hpp>

namespace fs = boost::filesystem;
namespace pt = boost::posix_time;
namespace as = boost::asio;
class Log : private boost::noncopyable
{
public:
    void LogPath(const fs::path& filePath) {
        boost::system::error_code ec;
        if(fs::exists(filePath, ec)) {
            fs::remove(filePath);
        }
        this->ofStreamPtr_.reset(new fs::ofstream(filePath));
    };

    void WriteLog(std::size_t i) {
        assert(*this->ofStreamPtr_);
        std::lock_guard<std::mutex> lock(this->logMutex_);
        *this->ofStreamPtr_ << "Hello, World! " << i << "\n";
    };

private:
    std::mutex logMutex_;
    std::unique_ptr<fs::ofstream> ofStreamPtr_;
};

int main(int argc, char *argv[]) {
    if(argc != 2) {
        std::cout << "Wrong argument" << std::endl;
        exit(1);
    }
    std::size_t iter_count = boost::lexical_cast<std::size_t>(argv[1]);

    Log log;
    log.LogPath("log.txt");

    std::function<void(std::size_t)> func = std::bind(&Log::WriteLog, &log, std::placeholders::_1);

    auto start_time = pt::microsec_clock::local_time();
    ////// Version 1: use std::thread //////
//    {
//        std::vector<std::shared_ptr<std::thread> > threadList;
//        threadList.reserve(iter_count);
//        for(std::size_t i = 0; i < iter_count; i++) {
//            threadList.push_back(
//                std::make_shared<std::thread>(func, i));
//        }
//
//        for(auto it: threadList) {
//            it->join();
//        }
//    }

//    pt::time_duration duration = pt::microsec_clock::local_time() - start_time;
//    std::cout << "Version 1: " << duration << std::endl;

    ////// Version 2: use std::async //////
    start_time = pt::microsec_clock::local_time();
    {
        for(std::size_t i = 0; i < iter_count; i++) {
            auto result = std::async(func, i);
        }
    }

    duration = pt::microsec_clock::local_time() - start_time;
    std::cout << "Version 2: " << duration << std::endl;

    ////// Version 3: use boost::asio::io_service //////
//    start_time = pt::microsec_clock::local_time();
//    {
//        as::io_service ioService;
//        as::io_service::strand strand{ioService};
//        {
//            for(std::size_t i = 0; i < iter_count; i++) {
//                strand.post(std::bind(func, i));
//            }
//        }
//        ioService.run();
//    }

//    duration = pt::microsec_clock::local_time() - start_time;
//    std::cout << "Version 3: " << duration << std::endl;


}

With 4-core CentOS 7 box(gcc 4.8.5), Version 1(using std::thread) is about 100x slower compared to other implementations.

Iteration Version1   Version2   Version3
100       0.0034s    0.000051s  0.000066s
1000      0.038s     0.00029s   0.00058s
10000     0.41s      0.0042s    0.0059s
100000    throw      0.026s     0.061s

Why threaded version is so slow? I thought each thread won't take long time to complete Log::WriteLog function.

Solution

The function may never be called. You are not passing an std::launch policy in Version 2, so you are relying on the default behavior of std::async (emphasis mine):

Behaves the same as async(std::launch::async | std::launch::deferred, f, args...). In other words, f may be executed in another thread or it may be run synchronously when the resulting std::future is queried for a value.

Try re-running your benchmark with this minor change:

auto result = std::async(std::launch::async, func, i);

Alternatively, you could call result.wait() on each std::future in a second loop, similar to how you call join() on all of the threads in Version 1. This forces evaluation of the std::future.

Note that there is a major, unrelated, problem with this benchmark. func immediately acquires a lock for the full duration of the function call, which makes parallelism impossible. There is no advantage to using threads here - I suspect that it will be significantly slower (due to thread creation and locking overhead) than a serial implementation.