MPI code much slower when compiling with -fopenmp flag (for MPI with multi-thread)

I compile a Fortran 90 code with mpif90 compiler with two different makefiles, the first one looks like;

FC = mpif90
FFLAGS = -Wall -ffree-line-length-none 
FOPT = -O3

all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
        $(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
        $(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean: 
        rm -f *.o* rm -f *.o*

the second makefile looks very similar, I just added the -fopenmp flag;

FC = mpif90
FFLAGS = -Wall -ffree-line-length-none -fopenmp
FOPT = -O3

all: ParP2S.o ParP2S
ParP2S.o: ParP2S.f90
        $(FC) $(FFLAGS) $(FOPT) ParP2S.f90 -c
ParP2S: ParP2S.o
        $(FC) $(FFLAGS) $(FOPT) ParP2S.o -o ParP2S
clean: 
        rm -f *.o* rm -f *.o*

The second makefile is for a hybrid (MPI with OpenMP) version of the code. For now, I have exactly the same code but compiled with these different makefiles. In the second case, the code is more than 100 times slower. Any comments in what I am doing wrong?

edit 1: I am not running multi-threaded tasks. In fact, the code does not have any OpenMP directives, it is just the pure MPI code but compiled with a different makefile. Nevertheless, I did try running after the following commands (see below) and it didn't helped.

export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
mpirun -np 2 ./ParP2S

edit 2: I am using gcc version 4.9.2 (I know there was a bug with vectorization with fopenmp in an older version). I thought the inclusion of the -fopenmp flag could be inhibiting the compiler optimizations, however, after reading the interesting discussion (May compiler optimizations be inhibited by multi-threading?) I am not sure if this is the case. Furthermore, as my code does not have any OpenMP directives, I don't see why the code compiled with -fopenmp should be that slower.

edit3: When I run without -fopenmp (first makefile) it takes about 0.2 seconds without optimizations (-O0) and 0.08 seconds with optimizations (-O3), but including the flag -fopenmp it takes about 11 seconds with -O3 or -O0.

Solution

It turned out that the problem was really task affinity, as suggested by Vladimir F and Gilles Gouaillardet (thank you very much!).

First I realized I was running MPI with OpenMPI version 1.6.4 and not MVAPICH2, so the command export MV2_ENABLE_AFFINITY=0 has no real meaning here. Second, I was (presumably) taking care of the affinity of different OpenMP threads by setting

export OMP_PROC_BIND=true
export OMP_PLACES=cores

but I was not setting the correct bindings for the MPI processes, as I was incorrectly launching the application as

mpirun -np 2 ./Par2S

and it seems that, with OpenMPI version 1.6.4, a more appropriate way to do it is

mpirun -np 2 -bind-to-core -bycore -cpus-per-proc 2  ./hParP2S

The command -bind-to-core -bycore -cpus-per-proc 2 assures 2 cores for my application (see https://www.open-mpi.org/doc/v1.6/man1/mpirun.1.php and also Gilles Gouaillardet's comments on Ensure hybrid MPI / OpenMP runs each OpenMP thread on a different core). Without it, both MPI processes were going to one single core, which was the reason for the poor efficiency of the code when the flag -fopenmp was used in the Makefile.

Apparently, when running pure MPI code compiled without the -fopenmp flag different MPI processes go automatically to different cores, but with -fopenmp one needs to specify the bindings manually as described above.

As a matter of completeness, I should mention that there is no standard for setting the correct task affinity, so my solution will not work on e.g. MVAPICH2 or (possibly) different versions of OpenMPI. Furthermore, running nproc MPI processes with nthreads each in ncores CPUs would require e.g.

export OMP_PROC_BIND=true
export OMP_PLACES=cores
export OMP_NUM_THREADS=nthreads

mpirun -np nproc -bind-to-core -bycore -cpus-per-proc ncores ./hParP2S

where ncores=nproc*nthreads.

ps: my code has an MPI_all_to_all. The condition where more than one MPI process are on one single core (no hyperthreading) calling this subroutine should be the reason why the code was about 100 times slower.