c++multithreading memory vector parallel-processing

Multithreading increases time in c++

I'm trying to understand why increasing the number of threads (after a certain number) increases the CPU time instead of decreasing it.

A summary of what the code does: I have a main which create a large vector based on a dimension. I fill it with indexes (0..dimension-1) and then shuffle it. Then, in order to divide-and-conquer, I partition this vector giving a slice to each thread. I prepare a vector of vector of solutions, to give each entry to the threads. Each thread calls a function on each element of its slice and passing the reference to its prepared solution. This function just change the value of the solution at the given index in input, and then it just increases all the elements in the solution (I put this to increasing a bit the comp. time). Here the code:

#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>

using namespace std;
template<typename T>
inline double getMs(T start, T end) {
    return double(
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
        .count()) /
        1000;
}

void fun(const std::vector<int>& slice, int dimension, 
    vector<int>& internal_sol) {
    int size = dimension;

    for (auto const& s : slice) {
        internal_sol[s] = 45;
        internal_sol[int(s/2)] = 45;
        if (s != 0) {
            internal_sol[s - 1] = 45;
        }
        if (s != internal_sol.size() - 1) {
            internal_sol[s + 1] = 45;
        }
        for (auto& i_s : internal_sol)
            i_s += 1;
    }

}


int main(int) {

    
    std::random_device rd;
    std::mt19937 g(rd());

    unsigned int n = std::thread::hardware_concurrency();
    std::cout << n << " concurrent threads are supported.\n";

    std::srand(unsigned(std::time(nullptr)));

    int number_stops = 50000; 
    int number_transfers = 3; 
    int number_structures = 2; 

    auto dimension = number_stops * (number_transfers + 1) * number_structures;
    std::vector<int> v(dimension);
    std::iota(std::begin(v), std::end(v), 0);
    std::shuffle(std::begin(v), std::end(v), g);

    printf("stops:\t%d\ntransfers:\t%d\nstructures:\t%d\n\n", number_stops, number_transfers, number_structures);

    for (size_t np = 2; np < 17; np++) {

        size_t sz = dimension;
        size_t part = sz / np;
        auto start = std::chrono::steady_clock::now();
        std::cout << np << " threads: ";
        std::vector<std::thread> threads(np);


        auto start_ext_sol = std::chrono::steady_clock::now();
        vector<vector<int>> external_sol; //As TAUs from velociraptor
        for (size_t i = 0; i < np; i++) {
            external_sol.emplace_back(dimension, -1);
        }
        double elapsed_ext_sol = getMs(start_ext_sol, std::chrono::steady_clock::now());


        auto paraTask = [&](size_t start, size_t end, vector<int>& ext_sol) {
            for (size_t l = start; l < end; ++l)
                fun({v[l]}, dimension, ext_sol);
            for (int i = 0;i < ext_sol.size(); i++)
                ext_sol[i] = -1;
        };

        for (size_t i = 0; i < np; i++) {
            size_t start = i * part;
            size_t length = (i + 1 == np) ? sz - i * part : part;
            threads[i] =
                std::thread(paraTask, start, start + length, std::ref(external_sol[i]));
        }
        
        // Join the threads
        for (auto&& thread : threads) thread.join();

        double elapsed = getMs(start, std::chrono::steady_clock::now());

        printf("parallel completed: %.3f sec. preparing all vectors ext_sols: %.3f sec\n",
            elapsed, elapsed_ext_sol);
    }

    return 0;
}

The result obtained is:

16 concurrent threads are supported.
stops:  50000
transfers:      3
structures:     2

2 threads: parallel completed: 6.776 sec. preparing all vectors ext_sols: 0.000 sec
3 threads: parallel completed: 5.711 sec. preparing all vectors ext_sols: 0.004 sec
4 threads: parallel completed: 5.285 sec. preparing all vectors ext_sols: 0.003 sec
5 threads: parallel completed: 4.892 sec. preparing all vectors ext_sols: 0.001 sec
6 threads: parallel completed: 4.614 sec. preparing all vectors ext_sols: 0.008 sec
7 threads: parallel completed: 4.726 sec. preparing all vectors ext_sols: 0.001 sec
8 threads: parallel completed: 4.281 sec. preparing all vectors ext_sols: 0.004 sec
9 threads: parallel completed: 4.815 sec. preparing all vectors ext_sols: 0.007 sec
10 threads: parallel completed: 4.791 sec. preparing all vectors ext_sols: 0.008 sec
11 threads: parallel completed: 5.594 sec. preparing all vectors ext_sols: 0.002 sec
12 threads: parallel completed: 5.411 sec. preparing all vectors ext_sols: 0.012 sec
13 threads: parallel completed: 5.213 sec. preparing all vectors ext_sols: 0.010 sec
14 threads: parallel completed: 6.159 sec. preparing all vectors ext_sols: 0.005 sec
15 threads: parallel completed: 8.353 sec. preparing all vectors ext_sols: 0.006 sec
16 threads: parallel completed: 6.769 sec. preparing all vectors ext_sols: 0.012 sec

As you can see the time increases. I don't see why this happens. The dimension here is 450.000 integers for each threads, therefore around 1.7 MB (counting 4 bytes for each integer). Thus, we don't go over the dimension of the L1 cache, in my understanding. What is happening?

Here some details about my machine:

CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
RAM - 16 GB DDR4
Windows 11 Compiler - MS_VS 2022

And here the memory hardware report from CPU-Z

PS: I also tried removing the lopp

for (auto& i_s : internal_sol)
            i_s += 1

from the function, but the same happens (of course with lower timings)

Solution

Here is a more practical and more precise/specific explanation why the timings are increasing with more threads:

cache misses is the main scalability bottleneck, especially due to the shared L3 cache and the DRAM (also shared between cores);
hyper-threading does not help much, so using more than 8 threads is useless;
creating threads is definitively not an issue (negligible time).

Your i7-11700KF processor has 8 cores and 2 hardware threads/core. Thus, 8 of the 8 core can execute a thread without much scalability issue coming from the hardware itself. Using more than 8 threads causes 2 threads to run on the same core. This is why the timings are decreasing up to 8 cores and then start to increase again with >=9 threads. 1 core has two thread to run while others are likely only 1 thread to run. Note that threads can move from one core to another (unless you bind them manually), hence the instable timings beyond 8 threads.

Hyper-threading (the Intel technologie enabling cores to execute more than 1 thread per core) is mainly meant to increase the processor efficiency by running 2 threads in parallel on the same core. But there is a catch : the hardware threads share mostly the hardware resources. For example, the L1 and L2 caches are shared so having more thread on the same core also means less space per core. This is also the case for the instruction decoder which needs to decode the instruction of the two threads instead of just 1 (twice more instructions). This is why hyper-threading do not make codes running systematically faster. In fact, many codes are (a bit) slower because of the cache overheads (e.g. more cache misses due to cache thrashing). Codes running faster with hyper-threading are typically the one that are latency-bound, especially the one that are bound by the memory hierarchy (e.g. naive matrix transposition). That being said, this is not always true. For example, the line fill buffer unit of the processor responsible for registering memory accesses to the L2 due to L1 misses is shared between the 2 hardware threads so having two threads does not help if 1 can already saturate it. Such a thing happens in your case. Also not that codes running a long chain of dependent instructions with a high latency can also benefit a bit from hyper-threading (e.g. iterative numerical recipes using divisions of FMAs) but this is not your case. In the end, this is why hyper-threading does not help here.

Then the question is why the algorithm does not scale even on 8 cores. @463035818_is_not_a_number mostly explained why there is a theoretical limit to the scalability : the significant fraction of instructions being executed sequentially is expected to be the bottleneck. However, the code looks pretty embarrassingly parallel overall so it does not really help. The time to create and join threads (as indicated in the comments) is not an issue here because the code runs for several seconds while the time to create threads is no more than few milliseconds on most PCs (including mine).

The thing is your algorithm is bound by the memory hierarchy, and more specifically, the scalability of the code is bound by the L3 cache which is shared by all cores. Indeed, there are np vectors called external_sol[i] and each one takes dimension*sizeof(int) which is 50_000*(3+1)*2*4/1024**2 = 1.52 MiB in memory. Thus, with 8 threads, 12.16 MiB is used. Each vector does not fit in the L1 nor the L2 cache of your processor (respectively 80 KiB and 512 KiB per core). Vectors are fetched in each thread using a random integer between 0 and the size of the vector. This means fetched items cannot be stored for a while in the L1/L2 cache because previous values will quickly be evicted in order to read new ones and the shuffle no values to be read twice. The same cache lines can be read twice by the same thread but a given cache lines is likely evicted long before it is read again. This causes threads to load a cache line for each value from the L3 cache. The L3 cache is designed to scale but this code put a serious pressure on the L3 (throughput). There is another big issue : the L3 slices are 2 MiB per core and the L3 cache is not perfect so there are some cache misses causing the DRAM to be used. This is also why your program starts to be significantly slower with much more than 8 threads : the amount of space required for all the vectors is clearly too big for the L3 and most of the read/store needs to be done in directly in the DRAM with a much bigger latency. For example, with 11 threads, you need 11*1.52 = 16.72 MiB which is more than your L3 which should be 16 MiB wide. With 16 threads, it is even 24.32 MiB. Last but not least, there is another complex effect : while using more threads performs less memory accesses per threads, the accesses are more spread in memory so the probability to reuse cache lines drop as the number of threads grows.

I analysed the execution of this on on my i5-9600KF processor with a low-level profiler (VTune) and I can confirm a significant part of the accesses are performed in DRAM with 5 threads while this is clearly not the case with 1 thread. VTune also reports that a significant time is spent in cache misses (mainly the L2 on my processor since it is twice smaller than yours).

Note that the frequency of core is decreasing with more cores being active so for the processor to be power efficient (as indicated by @teapot418). This is why your processor (like many recent processor) cannot execute algorithms 8 times fast on 8 cores (assuming the target algorithms are efficient in sequential).