std::async performance on Windows and Solaris 10

I'm running a simple threaded test program on both a Windows machine (compiled using MSVS2015) and a server running Solaris 10 (compiled using GCC 4.9.3). On Windows I'm getting significant performance increases from increasing the threads from 1 to the amount of cores available; however, the very same code does not see any performance gains at all on Solaris 10.

The Windows machine has 4 cores (8 logical) and the Unix machine has 8 cores (16 logical).

What could be the cause for this? I'm compiling with -pthread, and it is creating threads since it prints all the "S"es before the first "F". I don't have root access on the Solaris machine, and from what I can see there's no installed tool which I can use to view a process' affinity.

Example code:

#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>

std::default_random_engine gen(std::chrono::system_clock::now().time_since_epoch().count());
std::normal_distribution<double> randn(0.0, 1.0);

double generate_randn(uint64_t iterations)
{
    // Print "S" when a thread starts
    std::cout << "S";
    std::cout.flush();

    double rvalue = 0;
    for (int i = 0; i < iterations; i++)
    {
        rvalue += randn(gen);
    }
    // Print "F" when a thread finishes
    std::cout << "F";
    std::cout.flush();

    return rvalue/iterations;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
        return 0;

    uint64_t count = 100000000;
    uint32_t threads = std::atoi(argv[1]);

    double total = 0;

    std::vector<std::future<double>> futures;
    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    // Start timing
    t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threads; i++)
    {
        // Start async tasks
        futures.push_back(std::async(std::launch::async, generate_randn, count/threads));
    }
    for (auto &future : futures)
    {
        // Wait for tasks to finish
        future.wait();
        total += future.get();
    }
    // End timing
    t2 = std::chrono::high_resolution_clock::now();

    // Take the average of the threads' results
    total /= threads;

    std::cout << std::endl;
    std::cout << total << std::endl;
    std::cout << "Finished in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << std::endl;
}

Solution

As a general rule, classes defined by the C++ standard library do not have any internal locking. Modifying an instance of a standard library class from more than one thread, or reading it from one thread while writing it from another, is undefined behavior, unless "objects of that type are explicitly specified as being sharable without data races". (N3337, sections 17.6.4.10 and 17.6.5.9.) The RNG classes are not "explicitly specified as being sharable without data races". (cout is an example of a stdlib object that is "sharable with data races" — as long as you haven't done ios::sync_with_stdio(false).)

As such, your program is incorrect because it accesses a global RNG object from more than one thread simultaneously; every time you request another random number, the internal state of the generator is modified. On Solaris, this seems to result in serialization of accesses, whereas on Windows it is probably instead causing you not to get properly "random" numbers.

The cure is to create separate RNGs for each thread. Then each thread will operate independently, and they will neither slow each other down nor step on each other's toes. This is a special case of a very general principle: multithreading always works better the less shared data there is.

There's an additional wrinkle to worry about: each thread will call system_clock::now at very nearly the same time, so you may end up with some of the per-thread RNGs seeded with the same value. It would be better to seed them all from a random_device object. random_device requests random numbers from the operating system, and does not need to be seeded; but it can be very slow. The random_device should be created and used inside main, and seeds passed to each worker function, because a global random_device accessed from multiple threads (as in the previous edition of this answer) is just as undefined as a global default_random_engine.

All told, your program should look something like this:

#include <iostream>
#include <vector>
#include <future>
#include <random>
#include <chrono>

static double generate_randn(uint64_t iterations, unsigned int seed)
{
    // Print "S" when a thread starts
    std::cout << "S";
    std::cout.flush();

    std::default_random_engine gen(seed);
    std::normal_distribution<double> randn(0.0, 1.0);

    double rvalue = 0;
    for (int i = 0; i < iterations; i++)
    {
        rvalue += randn(gen);
    }
    // Print "F" when a thread finishes
    std::cout << "F";
    std::cout.flush();

    return rvalue/iterations;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
        return 0;

    uint64_t count = 100000000;
    uint32_t threads = std::atoi(argv[1]);

    double total = 0;

    std::vector<std::future<double>> futures;
    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    std::random_device make_seed;

    // Start timing
    t1 = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < threads; i++)
    {
        // Start async tasks
        futures.push_back(std::async(std::launch::async,
                                     generate_randn,
                                     count/threads,
                                     make_seed()));
    }
    for (auto &future : futures)
    {
        // Wait for tasks to finish
        future.wait();
        total += future.get();
    }
    // End timing
    t2 = std::chrono::high_resolution_clock::now();

    // Take the average of the threads' results
    total /= threads;

    std::cout << '\n' << total
              << "\nFinished in "
              << std::chrono::duration_cast<
                   std::chrono::milliseconds>(t2 - t1).count()
              << " ms\n";
}