c++multithreading concurrency parallel-processing

How many threads should one use in a general class or function, used by different users on different architectures?

EDIT: Assume that the number of elements in the vectors, although equal, is arbitrary. This means that the vector might contain only a few elements (e.g. 10!), in which cases using threads would make no sense.

Suppose we want to implement a concurrent version of operator+= for a general purpose class that different people are going to use, on different architectures. The operator+= simply sums all the elements of two std::vectors. Let's imagine we have a lots of elements in said vectors, enough to use parallelism, and that both std::vectors have equal size.

We also assume that we use threads, similarly as described in the first chapter of "C++ Concurrency in Action". Something like this:

 Foo& asynchronous_sum(const Foo& rhs) {
        unsigned long const length=vec.size();
        unsigned long const min_per_thread=25;
        unsigned long const max_threads=(length+min_per_thread-1)/min_per_thread; 
        unsigned long const hardware_threads= std::thread::hardware_concurrency();
        unsigned long const num_threads=std::min(hardware_threads!=0?hardware_threads:2,max_threads);
        unsigned long const block_size=length/num_threads; 
        std::vector<std::thread> threads(num_threads-1); 
        int block_start = 0;
        for(unsigned long i=0;i<(num_threads-1);++i) {
            int block_end=block_start;
            block_end += block_size;
            threads[i]=std::thread([&](int start, int end) {
                 for (int i = start; i < end; ++i) {
                    vec[i] += rhs.vec[i];
                 }}, block_start,block_end);
            block_start=block_end; 
        }
        for (int i = block_start; i < vec.size(); ++i) {
            vec[i] += rhs.vec[i];
        }
        for (auto& current_thread: threads) {
            current_thread.join();
        }
 
        return *this;
    }

Note that 25 is an arbitrary number chosen by the book's author.

Let's now assume that we want to create N threads, where N is the optimal number of threads on a given architeture. So user A might have N=5, and user B might get N=3, depending on their architecture.

Is there any way to achieve this for a general purpose class? ...this could also be read as, how do "parallel algorithms" in C++ achieve this?

Solution

The parameters that must be tuned for optimal performance on parallel hardware are numerous: beyond a simple count of execution agents, CPU affinities, minimum block sizes, GPU thread layouts, and prefetch strategies all depend on the application and the machine. Standardizing a unified representation of these disparate variables would be inordinately complicated and quickly outdated.

As such, rather than attempting to provide detailed access to the platform, the standard library parallel algorithms instead target programs that do not handle parallelism in other ways themselves. Accordingly, the library may assume that it is the only user of parallelism on the hardware. Common implementations then use a process-wide thread pool to execute parallel algorithms, with heuristics for block sizes. This supports nested parallel algorithm calls in a reasonably natural way, since the thread counts do not explode.

It might therefore be worth using the standard parallel algorithms in your mathematical functions, in hopes that a client could benefit from them alone or in concert with their own usage thereof. It remains a risk, however, that the client needs more sophisticated parallelism and your efforts merely complicate matters. Doing more than using <execution> practically guarantees such an outcome.