Search code examples
floating-pointlanguage-agnosticieee-754

How to compute mean of two floating-point numbers?


What is the most accurate way to compute mean of two numbers in floating-point arithmetic? Let's consider the most common double precision 64-bit numbers.

  1. (a + b) / 2

  2. a / 2 + b / 2

  3. a + (b - a) / 2

These ways of computing the mean can give different results as can be seen in the C++ code below:

double a = 1.2;
double b = 3.6;
double mean1 = (a + b) / 2.0;
double mean2 = a / 2.0 + b / 2.0;
double mean3 = a + (b - a) / 2.0;
cout << fixed << setprecision(20);
cout << "mean1: " << mean1 << endl;
cout << "mean2: " << mean2 << endl;
cout << "mean3: " << mean3 << endl;

which produces result

mean1: 2.39999999999999991118
mean2: 2.39999999999999991118
mean3: 2.40000000000000035527

Question Mathematical properties of the computed average value of two floating point numbers? compares cases 1 and 2. Brief conclusion is that accuracy is the same, but case 1 can overflow, while case 2 can underflow.

mean1 and mean2 seem to provide more accurate result, as they are closer than mean3 to the true value 2.4. It seems that it is always the case that mean3 is less precise. Intuitively it makes sense, because calculation of mean3 requires 3 operations, so there are more chances to lose precision, while calculation of mean1 requires only 2 operations.


update: 1.2 and 3.6 are stored as

1.1999999999999999555910790149937383830547332763671875
3.600000000000000088817841970012523233890533447265625

So their true mean is

2.40000000000000002220446049250313080847263336181640625

However this value can not be represented in doulbles. The closest double value happens to be mean1, where mean1 and mean3 are

2.399999999999999911182158029987476766109466552734375 (difference 1.11e-16)
2.4000000000000003552713678800500929355621337890625   (difference 3.33e-16)

I've stumbled upon the third case in Fortran code https://www.cs.umd.edu/~oleary/LBFGS/FORTRAN/linesearch.f

stpf = stpc + (stpq - stpc)/2
stp = stx + p5*(sty - stx)

where p5 is 0.5

I'm not sure what the motivation was for this way of calculating mean. Probably it is the case of for n=2 of running average, as discussed in numerically stable running average of floating point values or in https://diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59, but it improves accuracy only for large number of elements.

If this is an attempt to avoid overflow, it doesn't seem to work. All three cases can produce either overflow or underflow.

#include <iostream>
#include <iomanip>

using namespace std;

double mean1(double a, double b) {
    return (a + b) / 2.0;
}

double mean2(double a, double b) {
    return a / 2.0 + b / 2.0;
}

double mean3(double a, double b) {
    return a + (b - a) / 2.0;
}

void test(double a, double b) {
    cout << setprecision(20);
    cout << "a: " << a << ", b: " << b << endl;
    cout << "mean1: " << mean1(a, b) << endl;
    cout << "mean2: " << mean2(a, b) << endl;
    cout << "mean3: " << mean3(a, b) << endl;
    cout << endl;
}

int main() {
    test(1e+308, 1e+308); // case 1 overflow
    test(4e-324, 4e-324); // case 2 underflow
    test(-1e+308, 1e+308); // case 3 overflow
}
a: 1.000000000000000011e+308, b: 1.000000000000000011e+308
mean1: inf
mean2: 1.000000000000000011e+308
mean3: 1.000000000000000011e+308

a: 4.9406564584124654418e-324, b: 4.9406564584124654418e-324
mean1: 4.9406564584124654418e-324
mean2: 0
mean3: 4.9406564584124654418e-324

a: -1.000000000000000011e+308, b: 1.000000000000000011e+308
mean1: 0
mean2: 0
mean3: inf

Furthermore, performance is a tiny bit worse for the third case (however results should be considered with caution, as they can greatly vary depending on context, but generally code with smaller number of instructions runs faster)

static void case_1(benchmark::State& state) {
  volatile double a = 1.2;
  volatile double b = 3.6;
  for (auto _ : state) {
    benchmark::DoNotOptimize((a + b) / 2.0);
  }
}
BENCHMARK(case_1);

static void case_2(benchmark::State& state) {
  volatile double a = 1.2;
  volatile double b = 3.6;
  for (auto _ : state) {
    benchmark::DoNotOptimize(a / 2.0 + b / 2.0);
  }
}
BENCHMARK(case_2);

static void case_3(benchmark::State& state) {
  volatile double a = 1.2;
  volatile double b = 3.6;
  for (auto _ : state) {
    benchmark::DoNotOptimize(a + (b - a) / 2.0);
  }
}
BENCHMARK(case_3);

https://quick-bench.com/q/V_STrIZ5CmIL0Q3KpbLX6G9-G90

Probably there are some logic behind that choice that I'm missing. Probably the code above was meant to be run on a different hardware that didn't support IEEE 754? So, is there any reason to compute mean as a + (b - a) / 2 instead of (a + b) / 2? Or is it just a slight inaccuracy and the straightforward way should be used to compute mean?


Solution

  • For accuracy, (a + b) / 2 or a / 2 + b / 2 should be preferred, because division by 2 is lossless in floating point unless it "underflows". The / 2 just decrements the exponent, so, really, errors are only introduced in the addition.

    You can * 0.5 instead of / 2.0 if you like -- it makes no difference.

    The form a + (b - a) / 2, however guarantees that:

    • If both a and b are nonnegative finite values, then so is the result. The (a + b)/2 form does not provide this guarantee; and
    • If both a and b are nonnegative finite values, and b >= a, then the result is always >= a and <= b. Neither of the other two forms provides that guarantee. This can be important, for example, if you're doing a binary search.