Search code examples
fortrangfortranintel-fortran

Compiler optimization when variables are reused


While benchmarking 'subtracting a vector from a matrix', I noticed Fortran compilers appear to be performing some sort of optimization when I reuse variables/code. It looks like the arrays are being reused from cache memory, however I'm not sure. I believe this optimization is causing discrepancies in my benchmark results and would like to identify the specific type of optimization and, if possible, turn it off.

For example, in the following code that compares 2 cases, an additional Case 3 is introduced which is identical to Case 1. However, the time taken to run Case 3 is reported to be much lesser than that for Case 1.

program main
  implicit none

  integer :: n = 1E7
  real*8, dimension(3) :: a
  real*8, allocatable, dimension(:, :) :: b, c
  real :: start, finish
  integer :: i

  allocate(b(n, 3))
  allocate(c(n, 3))

  call random_number(a)
  call random_number(b)

  ! Case 1: Do loop
  call cpu_time(start)
  do i = 1, 3
    c(:, i) = b(:, i) - a(i)
  enddo
  call cpu_time(finish)
  print*, 'do-loop : ', finish-start

  ! Case 2: Spread
  call cpu_time(start)
  c = b - spread(a, dim=1, ncopies=n)
  call cpu_time(finish)
  print*, 'spread  : ', finish-start

  ! Case 3: Do loop (again)
  call cpu_time(start)
  do i = 1, 3
    c(:, i) = b(:, i) - a(i)
  enddo
  call cpu_time(finish)
  print*, 'do-loop : ', finish-start

end program main

This produces similar results with Intel and GNU compilers as shown below. I have tried investigating using flags like -O0 and -qopt-report, but cannot understand why the code behaves so. Because the arrays are large, ulimit -s unlimited might be required (on Linux) to avoid a segmentation fault.

$ ifort reuse.f90 && ./a.out 
 do-loop :   0.2072840    
 spread  :   0.4781271    
 do-loop :   3.6670923E-02

$ gfortran reuse.f90 && ./a.out
 do-loop :   0.232345015    
 spread  :   0.342370987    
 do-loop :    4.52849865E-02


Solution

  • At least in Linux, the memory allocator uses the "optimistic memory allocation strategy" (or see Why can Fortran allocate such large arrays? for Fortran). It assumes that there will be enough memory, assigns the virtual address space and that is all. The memory pages are only assigned when you access the memory by assigning some values (or trying to read the undefined garbage).

    That has two implication.

    1. If you requested too much memory, the allocate may still succeed and the program may crash later.

    2. The first access will take more time.

    To remove the problem with the latter, initialize the memory first, e.g. C = 0.

    There are other reasons why you should disregard the first runs of any tests and always run them multiple times - not just one long test, but multiple short runs. There are various turbo modes in modern CPUs that may take some time to start, for example.