I used intel's ifort to compile my program
There are a lot nested loops like the one below
do i=1,imax
do j=1,jmax
if (a(i,j).ne.1) cycle
.....
.....
.....
enddo
enddo
as well as some loops like
do j=1,jmax
do i=1,imax
if (a(i,j).ne.1) cycle
.....
.....
.....
enddo
enddo
I noticed that Fortran stores arrays by column-major order, so I put j loop outside of i loop.
It confused me that, there seems a drop of performance.
In contrast, I tried to put i loop outsider of j loop for others loops. But there's still a performance loss, slight however
program p
integer ::i, j, k, s
real ::a(1000,1000)
do i=1,1000
do j=1,1000
call random_number(a(i,j))
enddo
enddo
do k=1,10000
do j=1,1000
do i=1,1000
if (a(i,j) .ge. 111) s = s+a(i,j)
enddo
enddo
enddo
print *,s
end
ifort -fPIC -no-prec-div -O3 -xCORE-AVX2 a.f90 -o a time ./a 10921820
real 0m2.221s
user 0m2.221s
sys 0m0.001s
program p
integer ::i, j, k, s
real ::a(1000,1000)
do i=1,1000
do j=1,1000
call random_number(a(i,j))
enddo
enddo
do k=1,10000
do i=1,1000
do j=1,1000
if (a(i,j) .ge. 111) s = s+a(i,j)
enddo
enddo
enddo
print *,s
end
ifort -fPIC -no-prec-div -O3 -xCORE-AVX2 a.f90 -o a
time ./a
10923324
real 0m4.459s
user 0m4.457s
sys 0m0.003s
The performance difference is stable, which may prove the optimization. However when I adapted it to my real project, I failed.
Thank you again, I'm quite new to stack overflow, thanks for your help
As you mention correctly, Fortran is column-major. This means that the column is the major index and that all columns of the matrix are stored one after the other in memory. Probably you assumed, that column-major means that looping over the column index means traversing the matrix over elements that are contiguous in memory, but this is not true. The matrix
| a11, a12, a13 |
| a21, a22, a23 |
| a31, a32, a33 |
is stored in column-major order as
a11, a21, a31, a12, a22, a32, a13, a23, a33
--column 1--- --column 2--- --column 3---
This means that when you create a two-dimensional loop in Fortran, the first index, which represents the row index, should always be the innermost to get the best performance, because then you traverse the matrix in the memory order that I wrote above. Therefore
do j=1,1000
do i=1,1000
if (a(i,j) .ge. 111) s = s + a(i,j)
enddo
enddo
gives, as you found as well, the best performance.