Multithreaded program calculating sum running faster on laptop than desktop PC

I am practicing multi-threaded programs as part of my Operating Systems course and wrote a very basic program calculating the sum from 1 to n. However, when I ran it on my Intel i5 1035G7 laptop, it ran faster than my desktop Ryzen 5 5600G which should be much stronger than my laptop. The serial version (just adding 1 to n without multi-threading) does run faster on my PC, so I am really confused why the multi-threaded one does not. Can someone explain why? Any help would be much appreciated.

Here is the basic structure of my program:


struct threadData
{
    unsigned long long *tempResult;
    long long start;
    long long end;
};

void *calculateTotal(void *arg)
{
    struct threadData *structPtr = (struct threadData *)arg;
    for (long long i = structPtr->start; i <= structPtr->end; i++)
    {
        *(structPtr->tempResult) += i;
    }
    pthread_exit(0);
}

int main(int argc, char *argv[])
{
    ...
    int numThreads = atoi(argv[1]);
    long long input = atoll(argv[2]);
    struct threadData threadDataArray[numThreads];
    unsigned long long result = 0;
    unsigned long long threadResult[numThreads];
    ...
    pthread_t threads[numThreads];

    for (int i = 0; i < numThreads; i++)
    {
        threadDataArray[i].start = i * input / (numThreads) + 1;
        threadDataArray[i].end = (i + 1) * input / (numThreads);
        threadDataArray[i].tempResult = &threadResult[i];
        int threadCreate = pthread_create(&threads[i], NULL, calculateTotal, (void *)&threadDataArray[i]);
        ...
    }
    // wait for all threads to join
    for (int i = 0; i < numThreads; i++)
    {
        int threadJoin = pthread_join(threads[i], NULL);
        ...
    }
    // calculate total
    for (int i = 0; i < numThreads; i++)
    {
        result += threadResult[i];
    }
    printf("(running in parallel)\nsum is: %llu\n", result);
}

The result from my laptop:

$ time ./ex2parallel 10 4897582469
(running in parallel)
sum is: 11993157022776859215

real    0m9.955s
user    1m0.628s
sys     0m0.010s

The result from my desktop:

$ time ./ex2parallel 10 4897582469
(running in parallel)
sum is: 11993157022776859215

real    0m31.003s
user    4m1.410s
sys     0m0.011s

I ran both programs on my laptop and desktop, and expected my Ryzen desktop to perform way better than my laptop.

Solution

False sharing. If you change these lines:


    unsigned long long threadResult[numThreads];

        threadDataArray[i].tempResult = &threadResult[i];

to be


    unsigned long long threadResult[numThreads*8];

        threadDataArray[i].tempResult = &threadResult[i*8];

And re-run your results, I expect that you will find both machines run it much faster, and likely your desktop will beat your laptop.

Your values, threadResults[n], are adjacent in memory, and being accessed simultaneously by 6 cpus on your laptop and 8 on your desktop. This causes a negative scaling -- the more cpus you add, the slower it gets. By moving them apart by 8*sizeof(long long), they are likely displaced on separate cache lines, or at least fewer cpus are hammering the same ones.