Search code examples
delphidelphi-xe5

Different optimizations in Math.Sum in Win32/64


I have the following code

const
  NumIterations = 10000000;
var
  i, j : Integer;
  x : array[1..100] of Double;
  Start : Cardinal;
  S : Double;
begin
  for i := Low(x) to High(x) do x[i] := i;

  Start := GetTickCount;
  for i := 1 to NumIterations do S := System.Math.Sum(x);
  ShowMessage('Math.Sum: ' + IntToStr(GetTickCount - Start));

  Start := GetTickCount;
  for i := 1 to NumIterations do begin
    S := 0;
    for j := Low(x) to High(x) do S := S + x[j];
  end;
  ShowMessage('Simple Sum: ' + IntToStr(GetTickCount - Start));
end;

When compiled for Win32 Math.Sum is considerably faster than the simple loop, as Math.Sum is written in Assembler and uses four-fold loop unrolling.

But when compiled for Win64, Math.Sum is considerably slower than the simple loop, because in 64-bit Math.Sum uses Kahan summation. This is an optimization for accuracy minimizing pile-up of errors during the summation process, but is considerably slower than even the simple loop.

I.e. when compiling for Win32 I get code optimized for speed, when compiling the same code for Win64 I get code optimized for accuracy. This is not exactly what I naively would expect.

Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.

Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.


Solution

  • Is Math.Sum still implemented identically (Assembler and loop unrolling in Win32, Kahan summation in Win64) in current versions of Delphi? I use Delphi-XE5.

    Yes (Delphi 10.3.2).

    Is there any sensible reason for this difference between Win32/64? Double is always 8 byte, so the accuracy should be identical in Win32/64.

    32-bit Delphi for Win32 uses the old FPU, while the 64-bit compiler uses SSE instructions. When the 64-bit compiler was introduced in XE2, many of the old assembly routines was not ported to 64-bit. Instead, some routines were ported with similar functionality as other modern compilers.


    You can enhance the 64-bit implementation a bit by introducing a Kahan summation function:

    program TestKahanSum;
    
    {$APPTYPE CONSOLE}
    
    uses
      System.SysUtils,Math,Diagnostics;
    
    function KahanSum(const input : TArray<Double>): Double;
    var
      sum,c,y,t : Double;
      i : Integer;         
    begin
        sum := 0.0;                 
        c := 0.0;                      
        for i := Low(input) to High(input) do begin
          y := input[i] - c;  
          t := sum + y; 
          c := (t - sum) - y; 
          sum := t;                 
        end;
        Result := sum;
    end;
    
    var
      dArr : TArray<Double>;
      res : Double;
      i : Integer;
      sw : TStopWatch;
    begin
      SetLength(dArr,100000000);
      for i := 0 to High(dArr) do dArr[i] := Pi;
      sw := TStopWatch.StartNew;
      res := Math.Sum(dArr);
      WriteLn('Math.Sum:',res,' [ms]:',sw.ElapsedMilliseconds);
      sw := TStopWatch.StartNew;
      res := KahanSum(dArr);
      WriteLn('KahanSum:',res,' [ms]:',sw.ElapsedMilliseconds);
      sw := TStopWatch.StartNew;
      res := 0;
      for i := 0 to High(dArr) do res := res + dArr[i];
      WriteLn('NaiveSum:',res,' [ms]:',sw.ElapsedMilliseconds);
      ReadLn;
    end.
    

    64-bit:

    Math.Sum: 3.14159265358979E+0008 [ms]:492
    KahanSum: 3.14159265358979E+0008 [ms]:359
    NaiveSum: 3.14159265624272E+0008 [ms]:246
    

    32-bit:

    Math.Sum: 3.14159265358957E+0008 [ms]:67
    KahanSum: 3.14159265358979E+0008 [ms]:958
    NaiveSum: 3.14159265624272E+0008 [ms]:277
    

    Pi with 15 digits is 3.14159265358979

    The 32-bit math assembly routine is accurate to 13 digits in this example, while the 64-bit math routine is accurate to 15 digits.


    Conclusion:

    • The 64 bit implementation is slower (by a factor of two compared to a naive summation), but more accurate than the 32-bit math routine.

    • Introducing an enhanced Kahan summation routine improves performance by 35%.