cmultithreadingperformancepthreadsopenmp

Lower than expected speedup when using multithreading


Remark: I feel a little bit stupid about this, but this might help someone

So, I am trying to improve the performance of a program by using parallelism. However, I am encountering an issue with the measured speedup. I have 4 CPUs:

~% lscpu
...
CPU(s):                4
...

However, the speedup is much lower than fourfold. Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation).

Purely sequential (add_seq.c):

#include <stddef.h>

int main() {
    for (size_t i = 0; i < (1ull<<36); i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return 0;
}

OpenMP (add_omp.c):

#include <stddef.h>

int main() {
    #pragma omp parallel for schedule(static)
    for (size_t i = 0; i < (1ull<<36); i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return 0;
}

POSIX threads (add_pthread.c):

#include <pthread.h>
#include <stddef.h>

void* f(void* x) {
    (void) x;
    const size_t count = (1ull<<36) / 4;
    for (size_t i = 0; i < count; i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return NULL;
}
int main() {
    pthread_t t[4];
    for (size_t i = 0; i < 4; i += 1) {
        pthread_create(&t[i], NULL, f, NULL);
    }
    for (size_t i = 0; i < 4; i += 1) {
        pthread_join(t[i], NULL);
    }
    return 0;
}

Makefile:

CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread  # just to be sure

all: add_seq add_omp add_pthread

So, now, running this (using zsh's time builtin):

% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp  -O3 -lpthread    add_seq.c   -o add_seq
cc -O3 -fopenmp  -O3 -lpthread    add_omp.c   -o add_omp
cc -O3 -fopenmp  -O3 -lpthread    add_pthread.c   -o add_pthread
./add_seq  24.49s user 0.00s system 99% cpu 24.494 total
./add_omp  52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread  52.92s user 0.00s system 398% cpu 13.266 total

Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. So counting billions of instructions:

>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916

So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. Why is it so?

Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; however, this had no effect on performance


Solution

  • Well, the answer is quite simple: there are really only two CPU cores:

    % lscpu
    ...
    Thread(s) per core:    2
    Core(s) per socket:    2
    Socket(s):             1
    ...
    

    So, although htop shows four CPUs, two are virtual and only there because of hyperthreading. Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does not make similar code faster (it is only useful when running two threads using different resources).

    So, in the end, what happens is that time/clock() measures the usage of each logical core as that of the underlying physical core. Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup.

    Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading.