c++multithreading multicore cpu-cores parallelism-amdahl

Why isn't N independent calculations N times faster on N threads?

I have an N core processor ( 4 in my case ). Why isn't N totally independent function calls on N threads roughly N times faster ( of course there is an overhead of creating threads, but read further )?

Look at the the following code:

namespace ch = std::chrono;
namespace mp = boost::multiprecision;
constexpr static unsigned long long int num = 3555;

// mp_factorial uses boost/multiprecision/cpp_int, so I get legit results

    ch::steady_clock::time_point s1 = ch::steady_clock::now();
    auto fu1 = std::async(std::launch::async, mp_factorial, num);
    auto fu2 = std::async(std::launch::async, mp_factorial, num);
    auto fu3 = std::async(std::launch::async, mp_factorial, num);
    auto fu4 = std::async(std::launch::async, mp_factorial, num);
    fu1.get(); fu2.get(); fu3.get(); fu4.get();
    ch::steady_clock::time_point e1 = ch::steady_clock::now();

    ch::steady_clock::time_point s2 = ch::steady_clock::now();
    mp_factorial(num);
    mp_factorial(num);
    mp_factorial(num);
    mp_factorial(num);
    ch::steady_clock::time_point e2 = ch::steady_clock::now();

    auto t1 = ch::duration_cast<ch::microseconds>(e1 - s1).count();
    auto t2 = ch::duration_cast<ch::microseconds>(e2 - s2).count();

    cout << t1 << " " << t2 << endl;

I get results like:

11756 20317

Thats roughly 2 times faster. I've also tried this with huge numbers, like num = 355555. I got really similar results:

177462588 346575062

Why is this the case? I'm perfectly aware of Amdahl's law, and that a multicored processor isn't always number_of_cores times faster, but when I have independent operations, I'd expect better results. At least something near number_of_cores.

Update:

As you can see, all threads are working as expected, so this is not the issue:

Workload of threads

Solution

The problem here is that you certainly have some large lumpy numbers, which do not fit in the L1 and L2 caches of your processor, meaning that the processor sits and twiddles its little ALU fingers whilst the memory controller is jumping all over the place trying to read a little bit of memory for each processor.

When you run in ONE thread, that one thread is will at least largely only work on three different memory regions (a = b * c, reading from b and c, writing to a).

When you do 4 threads, you have four different a = b * c; with three different streams of data each, leading to both more thrashing of caches, the memory controller and the "open pages" [pages here are a DRAM term, nothing to do with MMU pages, but you may also find that TLB misses are a factor as well].

So you get better performance from running more threads, but not 4x because of the large amount of data being consumed and produced by each thread, the memory interface is the bottle-neck. Other than getting a machine with a more efficient memory interface [and that may not be so easy], there's nothing much you can do about this - just accept that for this particular case, memory is more of a limiting factor than the calculation.

The ideal example for solving with multithreading are those that require a lot of calculation, but doesn't use very much memory. I have a simple prime number calculator and one that calculates "weird numbers", both give almost exactly Nx performance improvement when run on N cores [but would I start using these for numbers that are many times larger than 64-bits, it would stop giving the same benefit]

Edit: There's also the possibility of:

some function(s) that your code calls a lot are locking/blocking the other threads [possibly in a busy-wait fashion, if the implementation expects short wait-times, as calling the OS to wait a few dozen clock-cycles is pointless] - things like new and malloc and their freeing counterparts are plausible candidates.
True of false sharing - data is being shared between CPU-cores causing cache-content being sent back and forth between processors. Particularly small shared arrays that are accessed [and updated] from each thread can cause this to go wrong, even if the updates are done locklessly and with atomic operations.

The term "false" sharing is used when you have something like this

 // Some global array. 
 int array[MAX_THREADS];
 ....
 // some function that updates the global array
 int my_id = thread_id();
 array[my_id]++;

Although each thread has it's own array entry, the same cache-line is bounced from one CPU to the other. I once had a SMP (before multi-core) dhrystone benchmark that ran at 0.7x the performance of one processor when running on 2 processors - because ONE of the commonly accessed data-items was stored as an int array[MAX_THREADS]. This is of course a rather extreme example...