Search code examples
performanceparallel-processingopenmpperformance-testing

Why a 'for loop' inside a 'parallel for loop' takes longer than the same 'for loop' in a serial region?


I am testing the performance of a cluster where I am using 64 threads. I have written a simple code:

        unsigned int m(67000);
        double start_time_i(0.),end_time_i(0.),start_time_init(0.),end_time_init(0.),diff_time_i(0.),start_time_j(0.),end_time_j(0.),diff_time_j(0.),total_time(0.);

        cout<<"omp_get_max_threads : "<<omp_get_max_threads()<<endl;
        cout<<"omp_get_num_procs : "<<omp_get_num_procs()<<endl;
        omp_set_num_threads(omp_get_max_threads());
        unsigned int dim_i=omp_get_max_threads();
        unsigned int dim_j=dim_i*m;

        std::vector<std::vector<unsigned int>> vector;
        vector.resize(dim_i, std::vector<unsigned int>(dim_j, 0));


        start_time_init = omp_get_wtime();
        for (unsigned int j=0;j<dim_j;j++){
                        vector[0][j]=j;
        }
        end_time_init = omp_get_wtime();

        start_time_i = omp_get_wtime();
        #pragma omp parallel for
        for (unsigned int i=0;i<dim_i;i++){
                        start_time_j = omp_get_wtime();
                        for (unsigned int j=0;j<dim_j;j++) vector[i][j]=i+j;
                        end_time_j = omp_get_wtime();
                        cout<<"i "<<i<<" thread "<<omp_get_thread_num()<<" int_time = "<<(end_time_j-start_time_j)*1000<<endl;

        }
        end_time_i = omp_get_wtime();


        cout<<"time_final = "<<(end_time_i-start_time_i)*1000<<endl;
        cout<<"initial non parallel region "<< " time = "<<(end_time_init-start_time_init)*1000<<endl;

        return 0;

I do not understand why "(end_time_j-start_time_j)*1000" is much bigger (around 50) than the time I need to go through the same loop over j if I am outside from the parallel region, i.e "end_time_init-start_time_init" (around 1). omp_get_max_threads() and omp_get_num_procs() are both equal to 64.


Solution

  • In your loop you just fill a memory location with a lot of values. This task is not computation expensive, it depends on the speed of memory write. One thread can do it at a certain rate, but when you use N threads simultaneously, the total memory bandwidth remains the same on Shared-Memory Multicore systems (i.e most PCs, laptops) and it increases on Distributed-Memory Multicore systems (high-end serves). For more details please read this.

    So, depending on the system the speed of memory write either remains the same or decreases when running several loops concurrently. For me 50 times difference seems to be a bit large. I got the following results on compiler explorer (it means that it has to be a Distributed-Memory Multicore system):

    omp_get_max_threads : 4
    omp_get_num_procs : 2
    i 2 thread 2 int_time = 0.095537
    i 0 thread 0 int_time = 0.084061
    i 1 thread 1 int_time = 0.099578
    i 3 thread 3 int_time = 0.10519
    time_final = 0.868523
    initial non parallel region  time = 0.090862
    

    On my laptop I got the following (so it is a shared-memory multicore system):

    omp_get_max_threads : 8
    omp_get_num_procs : 8
    i 7 thread 7 int_time = 0.7518
    i 5 thread 5 int_time = 1.0555
    i 1 thread 1 int_time = 1.2755
    i 6 thread 6 int_time = 1.3093
    i 2 thread 2 int_time = 1.3093
    i 3 thread 3 int_time = 1.3093
    i 4 thread 4 int_time = 1.3093
    i 0 thread 0 int_time = 1.3093
    time_final = 1.915
    initial non parallel region  time = 0.1578
    

    In conclusion it does depend on the system you are using...