Calculations on vectors become slower after better optimization flag and OpenMP

Consider the following Fortran code

program example
    implicit none
    integer, parameter  ::  ik = selected_int_kind(15)
    integer, parameter  ::  rk = selected_real_kind(15,307)

    integer(ik)         :: N, i, j, pc, time_rate, start_time, end_time, M

    real(rk), allocatable:: K(:,:), desc(:,:)
    real(rk)                :: kij, dij

    integer             :: omp_get_num_threads, nth
    N = 2000
    M = 400

    allocate(K(N,N))
    allocate(desc(N,M))

    pc=10
    do i = 1, N
        desc(i,:) = real(i,rk)
        if (i==int(N*pc)/100) then
            print * ,"desc % complete: ",pc
            pc=pc+10
        endif
    enddo
    call system_clock(start_time)
    !$OMP PARALLEL PRIVATE(nth)
    nth = omp_get_num_threads()
    print *,"omp threads", nth
    !$OMP END PARALLEL

    !$OMP PARALLEL DO &
    !$OMP DEFAULT(SHARED) &
    !$OMP PRIVATE(i,j,dij,kij)
    do i = 1, N
        do j = i, N
            dij = sum(abs(desc(i,:) - desc(j,:)))
            kij = dexp(-dij)
            K(i,j) = kij
            K(j,i) = kij
        enddo
        K(i,i) = K(i,i) + 0.1
    enddo
    !$OMP END PARALLEL DO

    call system_clock(end_time, time_rate)
    print* , "Time taken for Matrix:", real(end_time - start_time, rk)/real(time_rate, rk)

end program example

I compiled it using gfortran-6 on MacOS X 10.11 usin following flags

gfortran example.f90 -fopenmp -O0
gfortran example.f90 -fopenmp -O3
gfortran example.f90 -fopenmp -mtune=native

following which I ran it with single and double threads using OMP_NUM_THREADS variable. I can see that it is utilizing two cores. However O3 flag which should enable vectorization, does not help the performance at all, if anything it degrades it a bit. Timings are given below (in seconds) (avgd over 10 runs):

|Thrds->|   1  |  2  |
|Opt    |      |     |
----------------------
|O0     |10.962|9.183|
|O3     |11.581|9.250|
|mtune  |11.211|9.084|

What is wrong in my program?

Solution

First of all, if you want good performance from -O3, you should give it something that can actually be optimised. The bulk of the work happens in the sum intrinsic, which works on a vectorised expression. It doesn't get any more optimised when you switch from -O0 to -O3.

Also, if you want better performance, transpose desc because desc(i,:) is non-contiguous in memory. desc(:,i) is. That's Fortran - its matrices are column-major.