arraysfortrangfortranintel-fortran

Sum and assign of array is slower in derived types


I was comparing the performance of doing a sum followed by an assignment of two arrays, in the form of c=a+b, between a native Fortran type, real, and a derived data type that only contains one array of real. The class is very simple: it contains operators for addition and assignment and a destructor, as follows:

module type_mod

use iso_fortran_env

type :: class_t
  real(8), dimension(:,:), allocatable :: a
contains
  procedure :: assign_type
  generic, public :: assignment(=) => assign_type
  procedure :: sum_type
  generic :: operator(+) => sum_type
  final :: destroy
end type class_t

contains

  subroutine assign_type(lhs, rhs)
    class(class_t), intent(inout) :: lhs
    type(class_t), intent(in) :: rhs
    lhs % a = rhs % a
  end subroutine assign_type

  subroutine destroy(this)
    type(class_t), intent(inout) :: this
    if (allocated(this % a)) deallocate(this % a)
  end subroutine destroy

  function sum_type (lhs, rhs) result(res)
    class(class_t), intent(in) :: lhs
    type(class_t), intent(in) :: rhs
    type(class_t) :: res
    res % a = lhs % a + rhs % a
  end function sum_type

end module type_mod

The assign subroutine contains different modes of operations, just for the sake of benchmarking.

To test it against performing the same operations on a real I created the following module

module subroutine_mod

  use type_mod, only: class_t

  contains

  subroutine sum_real(a, b, c)
    real(8), dimension(:,:), intent(inout) :: a, b, c
    c = a + b
  end subroutine sum_real


  subroutine sum_type(a, b, c)
    type(class_t), intent(inout) :: a, b, c
    c = a + b
  end subroutine sum_type

end module subroutine_mod

Everything is executed in the program below, considering arrays of size (10000,10000) and repeating the operation 100 times:

program test

  use subroutine_mod

  integer :: i
  integer :: N = 100 ! Number of times to repeat the assign
  integer :: M = 10000 ! Size of the arrays
  real(8) :: tf, ts
  real(8), dimension(:,:), allocatable :: a, b, c
  type(class_t) :: a2, b2, c2

  allocate(a2%a(M,M), b2%a(M,M), c2%a(M,M))
  a2%a = 1.0d0
  b2%a = 2.0d0
  c2%a = 3.0d0

  allocate(a(M,M), b(M,M), c(M,M))
  a = 1.0d0
  b = 2.0d0
  c = 3.0d0

  ! Benchmark timing with
  call cpu_time(ts)
  do i = 1, N
    call sum_type(a2, b2, c2)
  end do
  call cpu_time(tf)
  write(*,*) "Type : ", tf-ts

  call cpu_time(ts)
  do i = 1, N
    call sum_real(a, b, c)
  end do
  call cpu_time(tf)
  write(*,*) "Real : ", tf-ts

end program test

To my surprise, the operation with my derived datatype consistently underperformed the operation with the Fortran arrays by a factor of 2 with gfortran and a factor of 10 with ifort. For instance, using the CHECK_SIZE mode, which saves allocation time, I got the following timings compiling with the -O2 flag:

gfortran

  • Data type: 33 s
  • Real : 13 s

ifort

  • Data type: 30 s
  • Real : 3 s

Question

Is this normal behaviour? If so, are there any recommendations to achieve better performance?

Context

To provide some context, the type with a single array will be very useful for a code refactoring task, where we need to keep similar interfaces to a previous type.

Compiler versions

  • gfortran 9.4.0
  • ifort 2021.6.0 20220226

Solution

  • You are worried about allocation time, but you do a lot of allocations of arrays of shape [M,M] for the derived type, and almost none for the intrinsic type.

    The only allocations for the intrinsic type are in the main program, for a, b and c. These are outside the timing loop.

    For the derived type, you allocate for a2%a, b2%a and c2%a (again outside the timing loop), but also res%a in the function sum, N times inside the timing loop.

    Equally, inside the sum_real subroutine the assignment statement c=a+b involves no allocatable object but inside sum_type the c in c=a+b is an allocatable array: the compiler checks whether c is allocated and if so, whether its shape matches the right-hand side expression.

    In summary: you are not comparing like with like. There's a lot of overhead in wrapping an intrinsic array as an allocatable component of a derived type.


    Tangential to your timing concerns is the "cleverness" of the subroutine assign. It's horrible.

    Calling an argument lhs when it's associated with the right-hand side of the assignment statement is a little confusing, but the select case construct is confusing beyond a little.

    In

    case (ASSUMED_SIZE)
      this % a = lhs % a
    

    under rules where the rest of the program makes any sense, invokes a couple of checks:

    • is this%a allocated? If not, allocate it to the shape of lhs%a.
    • if it is allocated, check whether the shape matches lhs%a, if not deallocate it then allocate it to the shape of lhs%a.

    Those checks and actions which are done manually in the CHECK_SIZE case, in other words.

    The final subroutine does nothing of value, so the entire assign subroutine's execution can be replaced by this%a = lhs%a.

    (Things would be different if the final subroutine had substantive effect or the compiler had been asked to ignore the rules of intrinsic assignment using -fno-realloc-arrays or -nostandard-realloc-lhs for example, or this%a(:,:)=lhs%a had been used.)