I am practicing multi-threaded programs as part of my Operating Systems course and wrote a very basic program calculating the sum from 1 to n. However, when I ran it on my Intel i5 1035G7 laptop, it ran faster than my desktop Ryzen 5 5600G which should be much stronger than my laptop. The serial version (just adding 1 to n without multi-threading) does run faster on my PC, so I am really confused why the multi-threaded one does not. Can someone explain why? Any help would be much appreciated.
Here is the basic structure of my program:
struct threadData
{
unsigned long long *tempResult;
long long start;
long long end;
};
void *calculateTotal(void *arg)
{
struct threadData *structPtr = (struct threadData *)arg;
for (long long i = structPtr->start; i <= structPtr->end; i++)
{
*(structPtr->tempResult) += i;
}
pthread_exit(0);
}
int main(int argc, char *argv[])
{
...
int numThreads = atoi(argv[1]);
long long input = atoll(argv[2]);
struct threadData threadDataArray[numThreads];
unsigned long long result = 0;
unsigned long long threadResult[numThreads];
...
pthread_t threads[numThreads];
for (int i = 0; i < numThreads; i++)
{
threadDataArray[i].start = i * input / (numThreads) + 1;
threadDataArray[i].end = (i + 1) * input / (numThreads);
threadDataArray[i].tempResult = &threadResult[i];
int threadCreate = pthread_create(&threads[i], NULL, calculateTotal, (void *)&threadDataArray[i]);
...
}
// wait for all threads to join
for (int i = 0; i < numThreads; i++)
{
int threadJoin = pthread_join(threads[i], NULL);
...
}
// calculate total
for (int i = 0; i < numThreads; i++)
{
result += threadResult[i];
}
printf("(running in parallel)\nsum is: %llu\n", result);
}
The result from my laptop:
$ time ./ex2parallel 10 4897582469
(running in parallel)
sum is: 11993157022776859215
real 0m9.955s
user 1m0.628s
sys 0m0.010s
The result from my desktop:
$ time ./ex2parallel 10 4897582469
(running in parallel)
sum is: 11993157022776859215
real 0m31.003s
user 4m1.410s
sys 0m0.011s
I ran both programs on my laptop and desktop, and expected my Ryzen desktop to perform way better than my laptop.
False sharing. If you change these lines:
unsigned long long threadResult[numThreads];
threadDataArray[i].tempResult = &threadResult[i];
to be
unsigned long long threadResult[numThreads*8];
threadDataArray[i].tempResult = &threadResult[i*8];
And re-run your results, I expect that you will find both machines run it much faster, and likely your desktop will beat your laptop.
Your values, threadResults[n], are adjacent in memory, and being accessed simultaneously by 6 cpus on your laptop and 8 on your desktop. This causes a negative scaling -- the more cpus you add, the slower it gets. By moving them apart by 8*sizeof(long long), they are likely displaced on separate cache lines, or at least fewer cpus are hammering the same ones.