Different optimizations in Math.Sum in Win32/64

I have the following code

const
  NumIterations = 10000000;
var
  i, j : Integer;
  x : array[1..100] of Double;
  Start : Cardinal;
  S : Double;
begin
  for i := Low(x) to High(x) do x[i] := i;

  Start := GetTickCount;
  for i := 1 to NumIterations do S := System.Math.Sum(x);
  ShowMessage('Math.Sum: ' + IntToStr(GetTickCount - Start));

  Start := GetTickCount;
  for i := 1 to NumIterations do begin
    S := 0;
    for j := Low(x) to High(x) do S := S + x[j];
  end;
  ShowMessage('Simple Sum: ' + IntToStr(GetTickCount - Start));
end;

When compiled for Win32 Math.Sum is considerably faster than the simple loop, as Math.Sum is written in Assembler and uses four-fold loop unrolling.

But when compiled for Win64, Math.Sum is considerably slower than the simple loop, because in 64-bit Math.Sum uses Kahan summation. This is an optimization for accuracy minimizing pile-up of errors during the summation process, but is considerably slower than even the simple loop.

I.e. when compiling for Win32 I get code optimized for speed, when compiling the same code for Win64 I get code optimized for accuracy. This is not exactly what I naively would expect.

Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.

Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.

Solution

Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.

Yes (Delphi 10.3.2).

Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.

32-bit Delphi for Win32 uses the old FPU, while the 64-bit compiler uses SSE instructions. When the 64-bit compiler was introduced in XE2, many of the old assembly routines was not ported to 64-bit. Instead, some routines were ported with similar functionality as other modern compilers.

You can enhance the 64-bit implementation a bit by introducing a Kahan summation function:

program TestKahanSum;

{$APPTYPE CONSOLE}

uses
  System.SysUtils,Math,Diagnostics;

function KahanSum(const input : TArray<Double>): Double;
var
  sum,c,y,t : Double;
  i : Integer;         
begin
    sum := 0.0;                 
    c := 0.0;                      
    for i := Low(input) to High(input) do begin
      y := input[i] - c;  
      t := sum + y; 
      c := (t - sum) - y; 
      sum := t;                 
    end;
    Result := sum;
end;

var
  dArr : TArray<Double>;
  res : Double;
  i : Integer;
  sw : TStopWatch;
begin
  SetLength(dArr,100000000);
  for i := 0 to High(dArr) do dArr[i] := Pi;
  sw := TStopWatch.StartNew;
  res := Math.Sum(dArr);
  WriteLn('Math.Sum:',res,' [ms]:',sw.ElapsedMilliseconds);
  sw := TStopWatch.StartNew;
  res := KahanSum(dArr);
  WriteLn('KahanSum:',res,' [ms]:',sw.ElapsedMilliseconds);
  sw := TStopWatch.StartNew;
  res := 0;
  for i := 0 to High(dArr) do res := res + dArr[i];
  WriteLn('NaiveSum:',res,' [ms]:',sw.ElapsedMilliseconds);
  ReadLn;
end.

64-bit:

Math.Sum: 3.14159265358979E+0008 [ms]:492
KahanSum: 3.14159265358979E+0008 [ms]:359
NaiveSum: 3.14159265624272E+0008 [ms]:246

32-bit:

Math.Sum: 3.14159265358957E+0008 [ms]:67
KahanSum: 3.14159265358979E+0008 [ms]:958
NaiveSum: 3.14159265624272E+0008 [ms]:277

Pi with 15 digits is 3.14159265358979

The 32-bit math assembly routine is accurate to 13 digits in this example, while the 64-bit math routine is accurate to 15 digits.

Conclusion:

The 64 bit implementation is slower (by a factor of two compared to a naive summation), but more accurate than the 32-bit math routine.
Introducing an enhanced Kahan summation routine improves performance by 35%.