Using multiples times the following type of loops in my code :
#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<F_matrix_A.size(); i++){
for(int j=0; j<F_matrix_A.size(); j++){
F_previous_T[i][j] = F_previous[j][i];
}
}
#pragma omp parallel for schedule(dynamic, num_threads)
for(int i=0; i<F_matrix_A.size(); i++){
for(int k=0; k<F_matrix_A.size(); k++){
for(int j=0; j<=i; j++){
if(F_previous_T[i][k] != 0 && F_previous[k][j] !=0){
Fisher_new[i][j] += F_previous_T[i][k]*F_previous[k][j];
}
}
}
}
I get the best performances when I set before the parameter : #define num_threads 1
I am working on a work station with 64 cores (I see 128 processors when I do a /proc/cpuinfo
). I think it is a pity to not be able to benefit from this high number of processes.
Is it due to the specific pragma that I use :
#pragma omp parallel for schedule(dynamic, num_threads)
??
Is there other alternatives to get a lower runtime ? I saw on different forums that using with a pretty high number of processes could cause a significant overhead.
The size of my loop are typically of 1700x1700.
If someone had an idea, this would be fine to tell it.
UPDATE 1 : I have 2 versions of my code, one with GNU g++
and the other with Intel icpc
1) I am using the "generic" following Makefile :
ifneq "$(MAKECMDGOALS)" "clean"
include $(MAKECMDGOALS).make
endif
OBJECTS = $(SOURCES:.cpp=.o)
$(MAKECMDGOALS): $(SOURCES) $(EXECUTABLE)
$(EXECUTABLE): $(OBJECTS)
$(CXX) $(LDFLAGS) $(OBJECTS) -o $@
.cpp.o:
$(CXX) $(CXXFLAGS) $(LDFLAGS) $< -o $@
clean:
rm -f *.o
1) For GNU g++
, I compile with gnu.make
file :
CXX = g++ -std=c++11 -O3 -fopenmp
CXXFLAGS = -Wall -c
LDFLAGS = -march=native
LDFLAGS =
SOURCES = main.cpp TSAF_gnu.cpp
EXECUTABLE = main_gnu.exe
2) For Intel icpc
, I compile with intel.make
file :
CXX = icpc -std=c++11 -O3 -xHost -qopenmp
CXXFLAGS = -Wall -c -I${MKLROOT}/include
LDFLAGS = -mkl=parallel
LDFLAGS += -L${MKLROOT}/lib/intel64_lin -Wl,-rpath,${MKLROOT}/lib/intel64_lin -lmkl_intel_lp64 -lmkl_intel_thread \
-lmkl_core -liomp5 -lpthread
SOURCES = main.cpp TSAF_intel.cpp
EXECUTABLE = main_intel.exe
A standard run takes about 3 minutes.
The line schedule(dynamic, num_threads)
is likely to cause scalability issues.
Indeed, with a matrix of size 1700 and 64 threads, the chunk size of the dynamic schedule policy is 64. Thus, the number of chunks is floor(1700/64) = 26
which is far too small to feed the 64 threads!
Even with 32 threads, the work balancing is not very good. I think it is important to have at least 3-4 chunk per thread.
Increasing the granularity with the number of thread is weird. It is probably more relevant to set a granularity based on the input size. I advise to use either the schedule(guided)
or schedule(dynamic,chunksize)
with chunksize set to something like max(F_matrix_A.size() / (num_threads * 4), 1)
(although using schedule(dynamic,1)
should not be so bad if you do not add the collapse
).
Alternatively, you can use task and taskloops directives.
Also note that if you work on a machine with multiple NUMA nodes (this is probably the case since there are 64 cores), you should be very careful with a dynamic scheduling because threads may access to remote NUMA memory nodes decreasing significantly the performances (this is clearly something you do not want in your memory-bound code).
Update: you can work on the two vertical side of the array simultaneously to significantly reduce the variability of the inner-loop computation time. The result would be something like that:
#pragma omp parallel for schedule(static)
for(int i=0; i<(F_matrix_A.size()+1)/2; i++)
{
// Upper-part
for(int k=0; k<F_matrix_A.size(); k++)
for(int j=0; j<=i; j++)
if(F_previous_T[i][k] != 0 && F_previous[k][j] != 0)
Fisher_new[i][j] += F_previous_T[i][k]*F_previous[k][j];
// Lower-part (do not perform the middle twice)
if(i < F_matrix_A.size()/2)
{
const int i2 = F_matrix_A.size() - 1 - i;
for(int k=0; k<F_matrix_A.size(); k++)
for(int j=0; j<=i2; j++)
if(F_previous_T[i2][k] != 0 && F_previous[k][j] != 0)
Fisher_new[i2][j] += F_previous_T[i2][k]*F_previous[k][j];
}
}