Search code examples
cparallel-processingoperating-system

Attempt to parallelize the summation of natural numbers from 1 to N happens to lower computing speed instead?


Read in Silberschatz’s Operating System Concepts about the use of multithreading in order to perform “the summation of a non-negative integer in a separate thread using the well-known summation function”:

summation

In my experience, not only does adding more threads increase the time required to compute the summation, using single-threading would actually be faster than multithreading.

Now I’m left wondering if something has gone wrong with my implementation, or if multithreading is just not the right thing to use here.

void *calculateSum(void *arg)
{
    struct ThreadData *data = (struct ThreadData *)arg;
    data->sum = 0;
    for (int i = data->start; i <= data->end; ++i)
    {
        data->sum += i;
    }
    return NULL;
}

I'm using clock() to measure the time interval.


Solution

  • Clock measure CPU time. CPU time is not supposed to decrease with number of threads

    You are using clock to compute time.

    clock is not giving you the ellapsed time, but the CPU time.

    So, in an ideal world, using multiple thread keeps that exactly constant. If you are using 1 thread, it needs 10 seconds of CPU time, that are consumed in 10 seconds, assuming your PC has nothing else to do, and that all the computation is done with nothing blocking (nothing to wait while spending no CPU time).

    If you are using 10 threads, then, in an ideal world, it needs still 10 seconds of CPU time, but all 10 threads cooperate to consume that in only 1 second (that is 1 second of CPU time each, but in parallel).

    Now, the world is not ideal. The simple fact to create a multi-thread enabled version, will maybe lead to an algorithm that needs 12 seconds to compute the same thing. Plus, thread do not parallelize exactly (because, for example, of what was mentioned: there are cache fault for some data).

    So, in a non-ideal world, maybe now what was taking 10 seconds in mono-thread version, will take 12/(0.98*nth)+0.3 seconds (real time) with nth threads.

    (I totally made up this formula. The point is: it is more than 12/nth, and therefore more than nth/10.

    Still, that is still faster, not slower than monothread version (at least it is with your code; I've tried).

    CPU time is not faster. Obviously. It is even more. But that is normal. In no world, even an ideal one, a task is easier (takes less CPU time) because you implemented it with threads.

    So, long story short: don't use clock to measure how you parallelized things. That measure the cpu time needed to compute. Not how fast that was.

    Cache miss

    Once that corrected, you still have a pretty bad parallelization. It least, it doesn't become slower with number of threads. But it doesn't become much faster neither.

    That is because of this:

        for (int i = data->start; i <= data->end; ++i)
        {
            data->sum += i;
        }
    

    Don't use so extensively that memory. Sure, there is no conflict on the data (data->sum is unique to that thread. No seed for a semaphore here, no risk of corruption). But data->sum is part of a contiguous small array of ThreadData.

    So, they all (the data->sum of all threads) probably using the same cache-line. So you end up transferring memory endlessly when it is not needed.

    Use a local variable instead

    void *calculateSum(void *arg)
    {
        struct ThreadData *data = (struct ThreadData *)arg;
        long long localSum = 0;
        for (int i = data->start; i <= data->end; ++i)
        {
            localSum += i;
        }
        data->sum=localSum;
        return NULL;
    }
    

    And there you have probably what you were expected: if you double the number of threads, you almost half the elapsed time (measured by gettimeofday), but of course, still, not the cpu time (measured by clock)

    (localEnd is probably not needed. But well, in C, a for loop is more of what is called while in most language. The end condition is evaluated each time, so data->end is