performance parallel-processing openmp performance-testing

Why a 'for loop' inside a 'parallel for loop' takes longer than the same 'for loop' in a serial region?

I am testing the performance of a cluster where I am using 64 threads. I have written a simple code:

        unsigned int m(67000);
        double start_time_i(0.),end_time_i(0.),start_time_init(0.),end_time_init(0.),diff_time_i(0.),start_time_j(0.),end_time_j(0.),diff_time_j(0.),total_time(0.);

        cout<<"omp_get_max_threads : "<<omp_get_max_threads()<<endl;
        cout<<"omp_get_num_procs : "<<omp_get_num_procs()<<endl;
        omp_set_num_threads(omp_get_max_threads());
        unsigned int dim_i=omp_get_max_threads();
        unsigned int dim_j=dim_i*m;

        std::vector<std::vector<unsigned int>> vector;
        vector.resize(dim_i, std::vector<unsigned int>(dim_j, 0));


        start_time_init = omp_get_wtime();
        for (unsigned int j=0;j<dim_j;j++){
                        vector[0][j]=j;
        }
        end_time_init = omp_get_wtime();

        start_time_i = omp_get_wtime();
        #pragma omp parallel for
        for (unsigned int i=0;i<dim_i;i++){
                        start_time_j = omp_get_wtime();
                        for (unsigned int j=0;j<dim_j;j++) vector[i][j]=i+j;
                        end_time_j = omp_get_wtime();
                        cout<<"i "<<i<<" thread "<<omp_get_thread_num()<<" int_time = "<<(end_time_j-start_time_j)*1000<<endl;

        }
        end_time_i = omp_get_wtime();


        cout<<"time_final = "<<(end_time_i-start_time_i)*1000<<endl;
        cout<<"initial non parallel region "<< " time = "<<(end_time_init-start_time_init)*1000<<endl;

        return 0;

I do not understand why "(end_time_j-start_time_j)*1000" is much bigger (around 50) than the time I need to go through the same loop over j if I am outside from the parallel region, i.e "end_time_init-start_time_init" (around 1). omp_get_max_threads() and omp_get_num_procs() are both equal to 64.

Solution

In your loop you just fill a memory location with a lot of values. This task is not computation expensive, it depends on the speed of memory write. One thread can do it at a certain rate, but when you use N threads simultaneously, the total memory bandwidth remains the same on Shared-Memory Multicore systems (i.e most PCs, laptops) and it increases on Distributed-Memory Multicore systems (high-end serves). For more details please read this.

So, depending on the system the speed of memory write either remains the same or decreases when running several loops concurrently. For me 50 times difference seems to be a bit large. I got the following results on compiler explorer (it means that it has to be a Distributed-Memory Multicore system):

omp_get_max_threads : 4
omp_get_num_procs : 2
i 2 thread 2 int_time = 0.095537
i 0 thread 0 int_time = 0.084061
i 1 thread 1 int_time = 0.099578
i 3 thread 3 int_time = 0.10519
time_final = 0.868523
initial non parallel region  time = 0.090862

On my laptop I got the following (so it is a shared-memory multicore system):

omp_get_max_threads : 8
omp_get_num_procs : 8
i 7 thread 7 int_time = 0.7518
i 5 thread 5 int_time = 1.0555
i 1 thread 1 int_time = 1.2755
i 6 thread 6 int_time = 1.3093
i 2 thread 2 int_time = 1.3093
i 3 thread 3 int_time = 1.3093
i 4 thread 4 int_time = 1.3093
i 0 thread 0 int_time = 1.3093
time_final = 1.915
initial non parallel region  time = 0.1578

In conclusion it does depend on the system you are using...