I am working on a large program that I have parallelized with OMP, and in several places for the parallelization I have to make a choice between using pointers or copies of arrays of data. I had an intuitive sense that the choice might have an impact on compute time, but I am uncertain. In the example program given below, I am unable to discern a difference.
compile with:
gfortran paralleltest.f90 -fopenmp
run with:
>>> ./a.out 8 10000000
Copies (s): 0.835768998
results: 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032
Pointers (s): 0.837329030
results: 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032 8415334.9722864032
Given this example I am wondering about some questions:
Here is the fortran file:
! paralleltest.f90
module m_parallel_test
implicit none
type :: t_data
real, allocatable :: x(:)
end type t_data
type :: t_data_ptr
real, pointer :: x(:)
end type t_data_ptr
contains
real(kind=8) function aggregate_data(x) result(value)
real :: x(:)
integer :: i
do i=1,size(x)
value = value + cos(x(i))
end do
end function aggregate_data
subroutine associate_pointer(parent, ptr)
real, target :: parent(:)
real, pointer :: ptr(:)
ptr => parent
end subroutine associate_pointer
end module m_parallel_test
program main
use m_parallel_test
implicit none
character(len=32) :: arg
integer, parameter :: n_set = 8
integer :: n_threads, n_data, i
type(t_data) :: parent
type(t_data), allocatable :: copies(:)
type(t_data_ptr), allocatable :: pointers(:)
real(kind=8), allocatable :: copies_results(:)
real(kind=8), allocatable :: pointers_results(:)
real :: start_time, stop_time
call get_command_argument(1, arg)
read(arg, *) n_threads
call omp_set_num_threads(n_threads)
allocate(copies(n_set))
allocate(pointers(n_set))
allocate(copies_results(n_set))
allocate(pointers_results(n_set))
call get_command_argument(2, arg)
read(arg, *) n_data
allocate(parent%x(n_data))
call seed_rng(1)
call random_number(parent%x)
do i=1,n_set
copies(i)%x = parent%x
call associate_pointer(parent%x, pointers(i)%x)
end do
call cpu_time(start_time)
!$OMP PARALLEL DO
do i=1,n_set
copies_results(i) = aggregate_data(copies(i)%x)
end do
call cpu_time(stop_time)
print *, "Copies (s): ", stop_time - start_time
print *, "results: ", copies_results
call cpu_time(start_time)
!$OMP PARALLEL DO
do i=1,n_set
pointers_results(i) = aggregate_data(pointers(i)%x)
end do
call cpu_time(stop_time)
print *, "Pointers (s): ", stop_time - start_time
print *, "results: ", pointers_results
end program main
subroutine seed_rng(i)
integer, intent(in) :: i
integer, allocatable :: iseed(:)
integer :: n, j
call random_seed(size=n)
allocate(iseed(n))
iseed= [(i + j, j=1, n)]
call random_seed(put=iseed)
end subroutine seed_rng
The code is not conforming for two reasons:
the function result value
in aggregate_data
is not defined before being referenced.
in the main program, the pointer component x
of the elements of pointers
have undefined pointer association status when passed to the aggregate_data
procedure, on account of the first actual argument in the calls to associate_pointer
not having the TARGET attribute.
(You can call a procedure with a dummy argument that has the TARGET attribute with a corresponding actual argument that does not, but when you do, pointers associated with the dummy argument (such as ptr
inside associate_pointer
) become undefined when execution of the procedure completes.)
The errors in the program can be corrected by giving the parent
variable in the main program the TARGET attribute, and by sticking an appropriate assignment statement prior to the loop in aggregate_data
.
Beyond correctness, the only difference between the two loops is in the actual argument passed to aggregate_data
- one loop passes an array designated by an allocatable component, one loop passes a array designated by a pointer component. The procedure takes that array argument as an assumed shape array dummy argument without CONTIGUOUS - given typical implementation of assumed shape arrays there's no need for the compiler to do anything different passing pointer vs allocatable.
(If the dummy argument of aggregate_data was explicitly or implicitly contiguous, there may be a need for the compiler to copy the pointer array prior to the call as pointers can be associated with non-contiguous arrays. Allocatable arrays are always contiguous.)
Both loops write their result to an non-target allocatable array of different kind to the actual argument components - so there cannot be any aliasing across the assignment statement.
Neither actual argument is being modified within the loops, so potential aliasing issues associated with the pointer arguments designating the same array in different iterations are moot (besides which, from the compiler's perspective - that's a guarantee that is on the head of the programmer - it doesn't have to care).
All up, I'd expect the same machine instructions to be issued for the two loops. Similar timing is unsurprising.
Results may differ if different work is done in the loops - if aliasing or contiguity comes into play.
Where the approaches will differ is in the data copying required to set the allocatable array components arguments up. But that is (intentionally?) outside that timing regime.
As of Fortran 2003 on (which this code requires), do not use POINTER's unless you need reference semantics.