While benchmarking 'subtracting a vector from a matrix', I noticed Fortran compilers appear to be performing some sort of optimization when I reuse variables/code. It looks like the arrays are being reused from cache memory, however I'm not sure. I believe this optimization is causing discrepancies in my benchmark results and would like to identify the specific type of optimization and, if possible, turn it off.
For example, in the following code that compares 2 cases, an additional Case 3 is introduced which is identical to Case 1. However, the time taken to run Case 3 is reported to be much lesser than that for Case 1.
program main
implicit none
integer :: n = 1E7
real*8, dimension(3) :: a
real*8, allocatable, dimension(:, :) :: b, c
real :: start, finish
integer :: i
allocate(b(n, 3))
allocate(c(n, 3))
call random_number(a)
call random_number(b)
! Case 1: Do loop
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
! Case 2: Spread
call cpu_time(start)
c = b - spread(a, dim=1, ncopies=n)
call cpu_time(finish)
print*, 'spread : ', finish-start
! Case 3: Do loop (again)
call cpu_time(start)
do i = 1, 3
c(:, i) = b(:, i) - a(i)
enddo
call cpu_time(finish)
print*, 'do-loop : ', finish-start
end program main
This produces similar results with Intel and GNU compilers as shown below. I have tried investigating using flags like -O0
and -qopt-report
, but cannot understand why the code behaves so. Because the arrays are large, ulimit -s unlimited
might be required (on Linux) to avoid a segmentation fault.
$ ifort reuse.f90 && ./a.out
do-loop : 0.2072840
spread : 0.4781271
do-loop : 3.6670923E-02
$ gfortran reuse.f90 && ./a.out
do-loop : 0.232345015
spread : 0.342370987
do-loop : 4.52849865E-02
At least in Linux, the memory allocator uses the "optimistic memory allocation strategy" (or see Why can Fortran allocate such large arrays? for Fortran). It assumes that there will be enough memory, assigns the virtual address space and that is all. The memory pages are only assigned when you access the memory by assigning some values (or trying to read the undefined garbage).
That has two implication.
If you requested too much memory, the allocate
may still succeed and the program may crash later.
The first access will take more time.
To remove the problem with the latter, initialize the memory first, e.g. C = 0
.
There are other reasons why you should disregard the first runs of any tests and always run them multiple times - not just one long test, but multiple short runs. There are various turbo modes in modern CPUs that may take some time to start, for example.