Why did I failed to speed up my program by making it cache-friendly?

I used intel's ifort to compile my program

There are a lot nested loops like the one below

do i=1,imax
do j=1,jmax
if (a(i,j).ne.1) cycle
.....
.....
.....
enddo
enddo

as well as some loops like

do j=1,jmax
do i=1,imax
if (a(i,j).ne.1) cycle
.....
.....
.....
enddo
enddo

I noticed that Fortran stores arrays by column-major order, so I put j loop outside of i loop.

It confused me that, there seems a drop of performance.

In contrast, I tried to put i loop outsider of j loop for others loops. But there's still a performance loss, slight however

program p 
integer ::i, j, k, s
real ::a(1000,1000)

do i=1,1000
  do j=1,1000
    call random_number(a(i,j))
  enddo
enddo

do k=1,10000
  do j=1,1000
    do i=1,1000
      if (a(i,j) .ge. 111) s = s+a(i,j)
    enddo
  enddo
enddo

print *,s

end

ifort -fPIC -no-prec-div -O3 -xCORE-AVX2 a.f90 -o a time ./a 10921820

real    0m2.221s
user    0m2.221s
sys     0m0.001s

program p 
integer ::i, j, k, s
real ::a(1000,1000)

do i=1,1000
  do j=1,1000
    call random_number(a(i,j))
  enddo
enddo

do k=1,10000
  do i=1,1000
    do j=1,1000
      if (a(i,j) .ge. 111) s = s+a(i,j)
    enddo
  enddo
enddo

print *,s

end

ifort -fPIC -no-prec-div -O3 -xCORE-AVX2 a.f90 -o a

time ./a
    10923324

real    0m4.459s
user    0m4.457s
sys     0m0.003s

The performance difference is stable, which may prove the optimization. However when I adapted it to my real project, I failed.

Thank you again, I'm quite new to stack overflow, thanks for your help

Solution

As you mention correctly, Fortran is column-major. This means that the column is the major index and that all columns of the matrix are stored one after the other in memory. Probably you assumed, that column-major means that looping over the column index means traversing the matrix over elements that are contiguous in memory, but this is not true. The matrix

| a11, a12, a13 |
| a21, a22, a23 |
| a31, a32, a33 |

is stored in column-major order as

a11, a21, a31, a12, a22, a32, a13, a23, a33
--column 1---  --column 2---  --column 3---

This means that when you create a two-dimensional loop in Fortran, the first index, which represents the row index, should always be the innermost to get the best performance, because then you traverse the matrix in the memory order that I wrote above. Therefore

do j=1,1000
    do i=1,1000
        if (a(i,j) .ge. 111) s = s + a(i,j)
    enddo
enddo

gives, as you found as well, the best performance.