c performance gcc numerical-methods msvc12

MSVC FP library float erfcf(x) unexpectedly slow - 2x slower than double erfc(x). Why?

Can someone please verify that the slowness that I see with the MS library implementation of erfcf(x) is reproducible elsewhere. It might just be a benchmarking quirk. Also any suggestions for command line options to make gcc compiled code run faster.

I have adopted @Njuffa's test harness and function erfcf(x) for my experiments on better approximations of the functions that I am interested in. Because the benchmarks take a while and I like to have a progress bar I added a print a "." every so often. The dot printing stalls at tricky value ranges when the range of possible values is swept linearly. I'm using that to guide optimisations.

It has the advantage of system library support for float erfcf(x) and an accurate double erfc(x) as a reference to test against. In the process of testing I have noticed that the Intel 2024.1 ICX library is both fast and accurate, but there are some oddities with the Microsoft MSVC 17.1 implementation and inexplicable (to me) slowness in the dummy loop test when it is compiled with gcc 13.1. I suspect my lack of familiarity with gcc and/or systematic errors arising from running it in a virtual machine may be to blame. I'd be grateful for timings on a native Linux system for comparison. Compiler options are -O3 inline anything for maximum speed and FP mode precise.

This is the minimum reproducible example.

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <math.h>

//#define RANDOM  (1)

// helper routines

float uint32_as_float(uint32_t a)
{
   float r;
   memcpy(&r, &a, sizeof r);
   return r;
}

uint32_t float_as_uint32(float a)
{
   uint32_t r;
   memcpy(&r, &a, sizeof r);
   return r;
}

bool my_isnan(float x)
{
  //    used here so that -ffast-math can't optimise it away (as it does with system isnan()
  //    other options can be used here
  //    return __isnan((double)x);
  //    return __isnanf(x);
  uint32_t ix = float_as_uint32(x);
  //    return !((~ix) & 0x7f800000);  // Intel's choice is slightly faster
  return (ix & 0x7f800000) == 0x7f800000;
}

float erfcdf(float x)
{
  return (float)erfc((double)x);  // shim for calling double precision erfc
}

float dummy(float x)  
{
  return x;  // to determine loop overheads
}

void timefun(const char* name, float (*test_fun)(float), bool verbose)
{
  uint32_t argi, largi = 0;
  float arg, res, sum;
  time_t start0, start, end;
  printf("\nTiming %s\n", name);
  argi = 0;
  sum = 0.0;
  start0 = start = clock();
  do {
      arg = uint32_as_float(argi);
      res = (*test_fun)(arg);
      if (!my_isnan(res)) sum += res;
#ifdef RANDOM
      argi = (argi * 1664525 + 1013904223); // ranqd1 
#else
      argi++;
#endif
      if (verbose && ((argi & 0xff800000) != largi))
      {
          end = clock();
          largi = argi & 0xff800000;
          printf("Exp %x : %6.3f\n", argi >> 20, (float)(end - start) / CLOCKS_PER_SEC);
          start = clock();
      }
      if ((argi & 0x3ffffff) == 0) printf(".");
  } while (argi);
  end = clock();
  printf("\ntime taken %6.2f  sum = %g\n", (float)(end - start0) / CLOCKS_PER_SEC, sum);
}

int main(void)
{
  timefun("dummy", dummy, false);
  timefun("erfcf", erfcf, false);
  timefun("erfcdf", erfcdf, false);
  return 0;
}

It should compile as is on any of gcc, Intel or MS compilers. I'd be interested in benchmarks on other compilers too.

These are the figures. Linear tests every possible bit pattern of x (including Nans) in sequence from 0 to 0xffffffff. Random uses ranqd1 to break any branch prediction and still execute every possible value of x.

Compiler	Dummy Linear	Linear erfcf	Linear erfc	Rand Dummy	Rand erfcf	Rand erfc
gcc 13.1	31	31.6	51.1	38.7	192	169
Intel 2024.1	2.0	37.6	45.0	3.9	46.9	54.3
MS 17.1	2.0	96.2	47.6	4.0	275	125

I'm unhappy with the behaviour of GCC with the dummy(x) routine which simply returns x. It is an order of magnitude slower than the others. It also seems far too slow when fed pseudo random data. I presume that my lack of familiarity with that compiler has led to it being a lot slower than it should be. The summation for dummy will overflow so I suspect that I haven't got the compiler options right to handle that situation efficiently.

The command line I'm using to build it on Linux is:

gcc -O3 -march=native -mavx2 -finline-functions -Winline erfc_njaffa6.cpp -lm

I am fairly convinced that the MS slowness is due to naive processing of denorms in polynomials which can be very slow. Break into the debugger when the dots stall and you will see denorm values for x. MS is the only one where float erfcf speed is 2x slower than erfc for doubles. GCC and Intel the float linear versions of code are both marginally faster ~15% (as you might expect). GCC struggles and is slow for the pseudo random test and I don't understand why. Any suggestions how to make it run faster?

There is also a question of accuracy. My benchmark for that shows that all of the double erfc(x) implementations give 0.5 ULP when rounded to float (not surprisingly), but that only the Intel library implementation for float erfcf(x) achieves sub ULP accuracy (0.82 ULP worst case). GCC comes second with 2.8 ULP and MS trails in third with 5.7 ULP.

So that the MS float erfcf implementation is not only twice as slow as their double implementation but also 10x less accurate as well. Even with the shims to convert to and from double MS erfc(x) is 2x faster than erfcf(x). Most odd...

Solution

Thanks to the suggestions made above I upgraded my benchmark so that verbose mode now shows the execution time for each possible value of the FP exponent. The results are interesting and highlight problems with both MS & Intel erfcf library implementations.

The time signature for the really very slow (and inaccurate) MS erfcf(x) implementation is below:

Each operation on a denorm adds a hit of 0.3s to the execution time and the worst case it gets to 1.6s for the range -63 to -74 tailing off on either side. You can also see the effects of underflow arising in the region of positive x>10 (orange line).

Zooming into where denorms occur and propagate through the polynomial we see how bad it gets. This is clearly rather pointless since erfcf(x) == 1 for any value of |x|< 3.0e-8. It also handles raw denorms inputs very slowly as well.

The Intel time signature is better but it too has a problem for x>0. There is still an underflow generated when x>10.2 which makes all those exponents +4 to +128 take 0.2s.

Armed with these data and my new improved ARM remez algorithm code I set about making a gofaster precision float erfcf(x) function and this is the result. You can test it in @njuffa's test harness linked to in the question. It is just sub 1 ULP at 0.946 ULP relative error.

float new_erfcdf(float x)  // computation entirely in double precision to rule out rounding glitches
{
  double s, r, c;
  c = fabs(x); 
  if (c < 4e-8) return 1.0f;
  s = c * c;
  if (c > 10.2) { if (x > 0.0f) return 0.0f; else return 2.0f; } // on MSC this is *much* faster 
  if (c < 0.08) return (float)(1.0 + x * (-1.1283776919236866465056 + s * (0.3752183453419132113961)));
  if (my_isnan(x)) return x + x;
  if (c < 2.1)
  {
    return (float)   (1.0+x*(-1.12837903762067845122478 + s * (0.376124824775496900 + s * (-0.11283152344661124283 + s * (0.02685306318990355076153 + s * (-0.005208108400366166889408 + s * (0.00084248506000537495895 + s * (-0.000114081296888176590432631682 + s * (1.25819680304023068029935e-05 + s * (-1.053171458705915517055e-06 + s * (5.83737961310388214431176e-08 + s * (-1.57325913775639606424677e-09))))))))))) );
  }
  // else 4,4 rational after Njuffa    
  r = (1.031356822232436016816 + c * (0.943286562098123339 + c * (0.453315560300251674 + c * (-3.27806565406628617583e-06)))) / (1.0 + c * (2.19836964842143340 + c * (1.675510512431580478 + c * (0.8032578099371390774))));
  r = r * exp(-s);
  if (x < 0) r = 2.0 - r;
  return (float)r;
}

I tried to make a pure float version but the tricky cancelation of the first three terms in x when x ~ 2 seems to make that impossible so double precision arithmetic has been used to tame it. The code takes 7.9s to complete the sequential benchmark and 26s for random order. This is substantially faster than all the system libraries I have tested. Here is its timing signature (note the change of vertical scale):

It is possible that it could be made a bit faster or more accurate but this is essentially a proof of concept implementation to share. The choice of breakpoints is determined by needing to keep the error budget under control - they could be tweaked to make small improvements to accuracy. This test was done by hand. It is a very worthwhile improvement in speed.

A pure float version would potentially be slightly quicker but it is hopelessly inaccurate for 3 < s < 4. Any thoughts how to fix that? Someone with a better annealing code might be able to bring the coefficients under control but I suspect that there is too much catastrophic cancellation going on in that region to be safe in floats. I will raise this issue of a pure float implementation in a another thread later.