Related to this question, but I believe the issue is more clearly identified with this example.
I have some legacy code that looks like this:
subroutine ID_OG(N, DETERM)
use variables, only: ID
implicit real (A-H,O-Z)
implicit integer(I-N)
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_OG
Replacing use variables, only: ID
with real, dimension(N) :: ID
or real, dimension(:), allocatable :: ID
causes a noticeable performance loss. Why is this? Is this expected behavior? I am wondering if it has something to do with the program needing to repeatedly allocate memory for the local array ID
, while the use
statement allows the program to skip the memory allocation step.
In the legacy code ID
is in module variables
but it is only used within the subroutine ID_OG
. It is not used anywhere else in the code - it is not an input or an output. To me, it seems like good programming practice for ID
to be removed from module variables
and defined locally in the subroutine. But perhaps that isn't the case.
Minimum working example (MWE):
compiling as gfortran -O3 test.f95
using gfortran 8.2.0
MODULE variables
implicit none
real, dimension(:), allocatable :: ID
END MODULE variables
program test
use variables
implicit none
integer :: N
integer :: loop_max = 1e6
integer :: ii ! loop index
real :: DETERM
real :: t1, t2
real :: t_ID_OG, t_ID_header, t_ID_no_ID, t_OG_no_ID, t_allocate
character(*), parameter :: format_header = '((A5, 1X), 20(A12,1X))'
character(*), parameter :: format_data = '((I5, 1X), 20(ES12.5, 1X))'
open(1, file = 'TimingSubroutines_ID.txt', status = 'unknown')
write(1,format_header) 'N', 't_Legacy', 't_header', 't_head_No_ID', 't_Leg_no_ID', &
& 't_allocate'
do N = 1, 100
allocate(ID(N))
call CPU_time(t1)
do ii = 1, loop_max
CALL ID_OG(N, DETERM)
end do
call CPU_time(t2)
t_ID_OG = t2 - t1
print*, N, DETERM
call CPU_time(t1)
do ii = 1, loop_max
CALL ID_header(N, DETERM)
end do
call CPU_time(t2)
t_ID_header = t2 - t1
print*, N, DETERM
call CPU_time(t1)
do ii = 1, loop_max
CALL ID_header_no_ID(N, DETERM)
end do
call CPU_time(t2)
t_ID_no_ID = t2 - t1
print*, N, DETERM
call CPU_time(t1)
do ii = 1, loop_max
CALL ID_OG_no_ID(N, DETERM)
end do
call CPU_time(t2)
t_OG_no_ID = t2 - t1
print*, N, DETERM
call CPU_time(t1)
do ii = 1, loop_max
CALL ID_OG_allocate(N, DETERM)
end do
call CPU_time(t2)
t_allocate = t2 - t1
print*, N, DETERM
deallocate(ID)
write(1,format_data) N, t_ID_OG, t_ID_header, t_ID_no_ID, t_OG_no_ID, t_allocate
end do
end program test
subroutine ID_OG(N, DETERM)
use variables, only: ID
implicit real (A-H,O-Z)
implicit integer(I-N)
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_OG
subroutine ID_header(N, DETERM)
use variables, only: ID
implicit none
integer, intent(in) :: N
real, intent(out) :: DETERM
integer :: I
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_header
subroutine ID_header_no_ID(N, DETERM)
implicit none
integer, intent(in) :: N
real, intent(out) :: DETERM
integer :: I
real, dimension(N) :: ID
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_header_no_ID
subroutine ID_OG_no_ID(N, DETERM)
implicit real (A-H,O-Z)
implicit integer(I-N)
real, dimension(N) :: ID
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_OG_no_ID
subroutine ID_OG_allocate(N, DETERM)
implicit real (A-H,O-Z)
implicit integer(I-N)
real, dimension(:), allocatable :: ID
allocate(ID(N))
DETERM = 1.0
DO 1 I=1,N
1 ID(I)=0
DETERM = sum(ID)
end subroutine ID_OG_allocate
Allocating the arrays takes time. The compiler is free to allocate the local arrays where-ever it wants, but it can typically be adjusted by compiler-specific flags. Use -fstack-arrays
for gfortran to force local arrays to stack.
Allocating on the stack is just changing the stack pointer, it is virtually for free. Allocating on the heap, however, is more involved and requires some bookkeeping.
There are situations where local variables are in order and there are situations where global (module) variables are in order. One can also use local saved variables or variables that are components of some objects. One cannot say which one is better without seeing the complete design of the code in question.
FWIW, with -fstack-arrays
I do not see much difference except when allocating explicitly using allocate()
:
Explicit allocate
will always use the heap.
Without -fstack-arrays
I do see some:
The graphs are quite noisy because my notebook is running many processes at the same time.
This is not to say that one should always use -fstack-arrays
, I used to demonstrate the difference. The option is useful, but care must be taken to avoid a stack overflow error. -fmax-stack-var-size
may help with that.