Why does a subroutine with an array from a "use module" statement give faster performance than the same subroutine a locally sized array?

Related to this question, but I believe the issue is more clearly identified with this example.

I have some legacy code that looks like this:

subroutine ID_OG(N, DETERM)
  use variables, only: ID
  implicit real (A-H,O-Z)
  implicit integer(I-N)

  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)
end subroutine ID_OG

Replacing use variables, only: ID with real, dimension(N) :: ID or real, dimension(:), allocatable :: ID causes a noticeable performance loss. Why is this? Is this expected behavior? I am wondering if it has something to do with the program needing to repeatedly allocate memory for the local array ID, while the use statement allows the program to skip the memory allocation step.

In the legacy code ID is in module variables but it is only used within the subroutine ID_OG. It is not used anywhere else in the code - it is not an input or an output. To me, it seems like good programming practice for ID to be removed from module variables and defined locally in the subroutine. But perhaps that isn't the case.

Minimum working example (MWE): compiling as gfortran -O3 test.f95 using gfortran 8.2.0

MODULE variables
  implicit none

  real, dimension(:),   allocatable :: ID

END MODULE variables


program test
  use variables

  implicit none

  integer             :: N
  integer             :: loop_max = 1e6
  integer             :: ii                    ! loop index
  real                :: DETERM

  real :: t1, t2
  real :: t_ID_OG, t_ID_header, t_ID_no_ID, t_OG_no_ID, t_allocate

  character(*), parameter :: format_header = '((A5, 1X), 20(A12,1X))'
  character(*), parameter :: format_data = '((I5, 1X), 20(ES12.5, 1X))'

  open(1, file = 'TimingSubroutines_ID.txt', status = 'unknown')
  write(1,format_header) 'N', 't_Legacy', 't_header', 't_head_No_ID', 't_Leg_no_ID', &
                            & 't_allocate'

  do N = 1, 100

    allocate(ID(N))


    call CPU_time(t1)
    do ii = 1, loop_max
      CALL ID_OG(N, DETERM)
    end do
    call CPU_time(t2)
    t_ID_OG = t2 - t1
    print*, N, DETERM


    call CPU_time(t1)
    do ii = 1, loop_max
      CALL ID_header(N, DETERM)
    end do
    call CPU_time(t2)
    t_ID_header = t2 - t1
    print*, N, DETERM


    call CPU_time(t1)
    do ii = 1, loop_max
      CALL ID_header_no_ID(N, DETERM)
    end do
    call CPU_time(t2)
    t_ID_no_ID = t2 - t1
    print*, N, DETERM


    call CPU_time(t1)
    do ii = 1, loop_max
      CALL ID_OG_no_ID(N, DETERM)
    end do
    call CPU_time(t2)
    t_OG_no_ID = t2 - t1
    print*, N, DETERM


    call CPU_time(t1)
    do ii = 1, loop_max
      CALL ID_OG_allocate(N, DETERM)
    end do
    call CPU_time(t2)
    t_allocate = t2 - t1
    print*, N, DETERM


    deallocate(ID)
    write(1,format_data) N, t_ID_OG, t_ID_header, t_ID_no_ID, t_OG_no_ID, t_allocate

  end do



end program test


subroutine ID_OG(N, DETERM)
  use variables, only: ID
  implicit real (A-H,O-Z)
  implicit integer(I-N)


  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)

end subroutine ID_OG



subroutine ID_header(N, DETERM)
  use variables, only: ID
  implicit none

  integer, intent(in)  :: N
  real,    intent(out) :: DETERM
  integer              :: I


  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)

end subroutine ID_header



subroutine ID_header_no_ID(N, DETERM)
  implicit none

  integer, intent(in)  :: N
  real,    intent(out) :: DETERM
  integer              :: I
  real, dimension(N)   :: ID


  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)

end subroutine ID_header_no_ID


subroutine ID_OG_no_ID(N, DETERM)
  implicit real (A-H,O-Z)
  implicit integer(I-N)
  real, dimension(N)   :: ID


  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)

end subroutine ID_OG_no_ID


subroutine ID_OG_allocate(N, DETERM)
  implicit real (A-H,O-Z)
  implicit integer(I-N)
  real, dimension(:), allocatable :: ID

  allocate(ID(N))


  DETERM = 1.0
  DO 1 I=1,N
1       ID(I)=0
  DETERM = sum(ID)

end subroutine ID_OG_allocate

Solution

Allocating the arrays takes time. The compiler is free to allocate the local arrays where-ever it wants, but it can typically be adjusted by compiler-specific flags. Use -fstack-arrays for gfortran to force local arrays to stack.

Allocating on the stack is just changing the stack pointer, it is virtually for free. Allocating on the heap, however, is more involved and requires some bookkeeping.

There are situations where local variables are in order and there are situations where global (module) variables are in order. One can also use local saved variables or variables that are components of some objects. One cannot say which one is better without seeing the complete design of the code in question.

FWIW, with -fstack-arrays I do not see much difference except when allocating explicitly using allocate():

Explicit allocate will always use the heap.

Without -fstack-arrays I do see some:

The graphs are quite noisy because my notebook is running many processes at the same time.

This is not to say that one should always use -fstack-arrays, I used to demonstrate the difference. The option is useful, but care must be taken to avoid a stack overflow error. -fmax-stack-var-size may help with that.