c++multithreading optimization openmp intel

Find the best number of thread with Intel OpenMP : only 1 thread has better results than many threads

Using multiples times the following type of loops in my code :

#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<F_matrix_A.size(); i++){
    for(int j=0; j<F_matrix_A.size(); j++){
        F_previous_T[i][j] = F_previous[j][i];
    }
}

#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<F_matrix_A.size(); i++){
    for(int k=0; k<F_matrix_A.size(); k++){
        for(int j=0; j<=i; j++){
            if(F_previous_T[i][k] != 0 && F_previous[k][j] !=0){
                Fisher_new[i][j] += F_previous_T[i][k]*F_previous[k][j];
            }
        }
    }
}

I get the best performances when I set before the parameter : #define num_threads 1

I am working on a work station with 64 cores (I see 128 processors when I do a /proc/cpuinfo). I think it is a pity to not be able to benefit from this high number of processes.

Is it due to the specific pragma that I use :

#pragma omp parallel for schedule(dynamic, num_threads)

Is there other alternatives to get a lower runtime ? I saw on different forums that using with a pretty high number of processes could cause a significant overhead.

The size of my loop are typically of 1700x1700.

If someone had an idea, this would be fine to tell it.

UPDATE 1 : I have 2 versions of my code, one with GNU g++ and the other with Intel icpc

1) I am using the "generic" following Makefile :

ifneq "$(MAKECMDGOALS)" "clean"
include $(MAKECMDGOALS).make
endif

OBJECTS = $(SOURCES:.cpp=.o)

$(MAKECMDGOALS): $(SOURCES) $(EXECUTABLE)

$(EXECUTABLE): $(OBJECTS)
    $(CXX) $(LDFLAGS) $(OBJECTS) -o $@

.cpp.o:
    $(CXX) $(CXXFLAGS) $(LDFLAGS) $< -o $@

clean:
    rm -f *.o

1) For GNU g++, I compile with gnu.make file :

CXX = g++ -std=c++11 -O3 -fopenmp
CXXFLAGS = -Wall -c
LDFLAGS = -march=native
LDFLAGS =
SOURCES = main.cpp TSAF_gnu.cpp
EXECUTABLE = main_gnu.exe

2) For Intel icpc, I compile with intel.make file :

CXX = icpc -std=c++11 -O3 -xHost -qopenmp
CXXFLAGS = -Wall -c -I${MKLROOT}/include
LDFLAGS  = -mkl=parallel
LDFLAGS += -L${MKLROOT}/lib/intel64_lin -Wl,-rpath,${MKLROOT}/lib/intel64_lin -lmkl_intel_lp64 -lmkl_intel_thread \
          -lmkl_core -liomp5 -lpthread
SOURCES = main.cpp TSAF_intel.cpp
EXECUTABLE = main_intel.exe

A standard run takes about 3 minutes.

Solution

The line schedule(dynamic, num_threads) is likely to cause scalability issues.

Indeed, with a matrix of size 1700 and 64 threads, the chunk size of the dynamic schedule policy is 64. Thus, the number of chunks is floor(1700/64) = 26 which is far too small to feed the 64 threads! Even with 32 threads, the work balancing is not very good. I think it is important to have at least 3-4 chunk per thread.

Increasing the granularity with the number of thread is weird. It is probably more relevant to set a granularity based on the input size. I advise to use either the schedule(guided) or schedule(dynamic,chunksize) with chunksize set to something like max(F_matrix_A.size() / (num_threads * 4), 1) (although using schedule(dynamic,1) should not be so bad if you do not add the collapse).

Alternatively, you can use task and taskloops directives.

Also note that if you work on a machine with multiple NUMA nodes (this is probably the case since there are 64 cores), you should be very careful with a dynamic scheduling because threads may access to remote NUMA memory nodes decreasing significantly the performances (this is clearly something you do not want in your memory-bound code).

Update: you can work on the two vertical side of the array simultaneously to significantly reduce the variability of the inner-loop computation time. The result would be something like that:

#pragma omp parallel for schedule(static)
for(int i=0; i<(F_matrix_A.size()+1)/2; i++)
{
    // Upper-part
    for(int k=0; k<F_matrix_A.size(); k++)
        for(int j=0; j<=i; j++)
            if(F_previous_T[i][k] != 0 && F_previous[k][j] != 0)
                Fisher_new[i][j] += F_previous_T[i][k]*F_previous[k][j];

    // Lower-part (do not perform the middle twice)
    if(i < F_matrix_A.size()/2)
    {
        const int i2 = F_matrix_A.size() - 1 - i;

        for(int k=0; k<F_matrix_A.size(); k++)
            for(int j=0; j<=i2; j++)
                if(F_previous_T[i2][k] != 0 && F_previous[k][j] != 0)
                    Fisher_new[i2][j] += F_previous_T[i2][k]*F_previous[k][j];
    }
}