fortran scientific-computing quadruple-precision

CPU time in quadruple vs. double precision

I am doing some long-term simulations in which I'm trying to achieve the highest possible accuracy in the solution of systems of ODEs. I am trying to find out how much time quadruple (128-bit) precision calculations take versus double (64-bit) precision. I googled around a bit and saw several opinions about it: some say it will take 4 times longer, others 60-70 times... So I decided to take matters in my own hands and I wrote a simple Fortran benchmark program:

program QUAD_TEST

implicit none

integer,parameter  ::  dp = selected_int_kind(15)
integer,parameter  ::  qp = selected_int_kind(33)

integer   ::  cstart_dp,cend_dp,cstart_qp,cend_qp,crate
real      ::  time_dp,time_qp
real(dp)  ::  sum_dp,sqrt_dp,pi_dp,mone_dp,zero_dp
real(qp)  ::  sum_qp,sqrt_qp,pi_qp,mone_qp,zero_qp
integer   ::  i

! ==============================================================================

! == TEST 1. ELEMENTARY OPERATIONS ==
sum_dp = 1._dp
sum_qp = 1._qp
call SYSTEM_CLOCK(count_rate=crate)

write(*,*) 'Testing elementary operations...'

call SYSTEM_CLOCK(count=cstart_dp)
do i=1,50000000
  sum_dp = sum_dp - 1._dp
  sum_dp = sum_dp + 1._dp
  sum_dp = sum_dp*2._dp
  sum_dp = sum_dp/2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sum: ',sum_dp
write(*,*) 'DP time: ',time_dp,' seconds'

call SYSTEM_CLOCK(count=cstart_qp)
do i=1,50000000
  sum_qp = sum_qp - 1._qp
  sum_qp = sum_qp + 1._qp
  sum_qp = sum_qp*2._qp
  sum_qp = sum_qp/2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sum: ',sum_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)

! == TEST 2. SQUARE ROOT ==
sqrt_dp = 2._dp
sqrt_qp = 2._qp

write(*,*) 'Testing square root ...'

call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
   sqrt_dp = sqrt(sqrt_dp)
   sqrt_dp = 2._dp
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP sqrt: ',sqrt_dp
write(*,*) 'DP time: ',time_dp,' seconds'

call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
   sqrt_qp = sqrt(sqrt_qp)
   sqrt_qp = 2._qp
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP sqrt: ',sqrt_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)

! == TEST 3. TRIGONOMETRIC FUNCTIONS ==
pi_dp = acos(-1._dp); mone_dp = 1._dp; zero_dp = 0._dp
pi_qp = acos(-1._qp); mone_qp = 1._qp; zero_qp = 0._qp

write(*,*) 'Testing trigonometric functions ...'

call SYSTEM_CLOCK(count=cstart_dp)
do i = 1,10000000
    mone_dp = cos(pi_dp)
    zero_dp = sin(pi_dp)
end do
call SYSTEM_CLOCK(count=cend_dp)
time_dp = real(cend_dp - cstart_dp)/real(crate)
write(*,*) 'DP cos: ',mone_dp
write(*,*) 'DP sin: ',zero_dp
write(*,*) 'DP time: ',time_dp,' seconds'

call SYSTEM_CLOCK(count=cstart_qp)
do i = 1,10000000
    mone_qp = cos(pi_qp)
    zero_qp = sin(pi_qp)
end do
call SYSTEM_CLOCK(count=cend_qp)
time_qp = real(cend_qp - cstart_qp)/real(crate)
write(*,*) 'QP cos: ',mone_qp
write(*,*) 'QP sin: ',zero_qp
write(*,*) 'QP time: ',time_qp,' seconds'
write(*,*)
write(*,*) 'DP is ',time_qp/time_dp,' times faster.'
write(*,*)

end program QUAD_TEST

Results of a typical run, after compiling with gfortran 4.8.4, without any optimization flag:

 Testing elementary operations...
 DP sum:    1.0000000000000000     
 DP time:   0.572000027      seconds
 QP sum:    1.00000000000000000000000000000000000      
 QP time:    4.32299995      seconds

 DP is    7.55769205      times faster.

 Testing square root ...
 DP sqrt:    2.0000000000000000     
 DP time:    5.20000011E-02  seconds
 QP sqrt:    2.00000000000000000000000000000000000      
 QP time:    2.60700011      seconds

 DP is    50.1346169      times faster.

 Testing trigonometric functions ...
 DP cos:   -1.0000000000000000     
 DP sin:    1.2246467991473532E-016
 DP time:    2.79600000      seconds
 QP cos:   -1.00000000000000000000000000000000000      
 QP sin:    8.67181013012378102479704402604335225E-0035
 QP time:    5.90199995      seconds

 DP is    2.11087275      times faster.

Something must be going on here. My guess is that sqrt is computed with gfortran through an optimized algorithm, which probably hasn't been implemented for quadruple precision computations. This might not be the case for sin and cos, but why are elementary operations 7.6 times slower in quadruple precision while for trigonometric functions things slow down only by a factor of 2? If the algorithm used for the trigonometric functions would be the same for quad and double precision, I would expect to see a sevenfold increase in CPU time for them too.

What is the average slowdown in scientific calculations when 128-bit precision is used, compared to 64-bit?

I am running this on an Intel i7-4771 @ 3.50GHz.

Solution

More an extended comment than an answer, but...

Current CPUs provide a lot of hardware acceleration for double precision floating point arithmetic. Some even provide facilities for extended precision. Beyond that, you are limited to software implementations which are (as you noticed) considerably slower.

However, the exact factor of this slow-down is almost impossible to predict in generic cases. It depends on your CPU (e.g. which acceleration it has built-in), and on the software stack. For double-precision you typically use different math-libraries than for quadruple precision, and these might use different algorithms for basic operations.

For a particular operation/algorithm on a given hardware using the same algorithm you can probably derive a number, but this for sure will not be universally true.