I'm suffering from a numerical error in the following code example (I've added the Kahan summation attempt and a more clever naive, but still naive version, below; but it's even worse unfortunately):
#include <algorithm>
#include <iostream>
#include <random>
#include <vector>
int main()
{
std::random_device rd;
std::mt19937 g{ rd() };
std::uniform_real_distribution<> u;
static std::size_t constexpr n = 1000;
std::vector<double> q(n);
std::generate_n(q.begin(), q.size(), [&]() { return u(g); });
double average_of_q{};
for (auto const& q : q)
average_of_q += q;
average_of_q /= n;
std::vector<double> f(n);
std::generate_n(f.begin(), n, [&]() { return u(g); });
double sum1{};
for (std::size_t i = 0; i < n; ++i)
sum1 += std::abs(f[i] - q[i]);
sum1 /= n;
{
double sum2{};
for (std::size_t i = 0; i < n; ++i)
sum2 += std::abs(f[i] - q[i]) - q[i];
sum2 = sum2 / n + average_of_q;
std::cout << "naive: " << std::abs(sum1 - sum2) << std::endl;
}
{
double sum2{},
c{};
for (std::size_t i = 0; i < n; ++i)
{
double const x = std::abs(f[i] - q[i]) - q[i] - c,
s = sum2 + x;
c = (s - sum2) - x;
sum2 = s;
}
sum2 = sum2 / n + average_of_q;
std::cout << "kahan: " << std::abs(sum1 - sum2) << std::endl;
}
{
double sum2{};
for (std::size_t i = 0; i < n; ++i)
{
if (f[i] - q[i] >= 0)
sum2 += f[i] - 2 * q[i];
else
sum2 -= f[i];
}
sum2 = sum2 / n + average_of_q;
std::cout << "more clever, but still naive: " << std::abs(sum1 - sum2) << std::endl;
}
return 0;
}
The output is 1.11022e-16
, while we would theoretically expect that it should be 0
. How can I optimize this code such that std::abs(sum1 - sum2)
is as small as possible?
To motivate this: In my actual application, I already know average_of_q
and I don't need to iterate over every i
, since I know that std::abs(f[i] - q[i])
is extremely small for most of the i
, which is why I need to want to use the formula for sum2
.
EDIT: I've asked for the theoretic part of this question on MSE as well (but it's slightly different; I didn't want to make things too complicated here): https://math.stackexchange.com/q/4688917/47771.
EDIT 2: I've also tried to "boost" the terms of the sum by multiplying them with a factor:
{
double sum2{};
for (std::size_t i = 0; i < n; ++i)
sum2 += 1000 * (std::abs(f[i] - q[i]) - q[i]);
sum2 = sum2 / (1000 * n) + average_of_q;
std::cout << "boosted: " << std::abs(sum1 - sum2) << std::endl;
}
It might be a useful information: In my actual application, many of the f[i]
are small compared to q[i]
. For simplicity, you can assume that all q[i] = 1
, many of the f[i]
are around 1e-10, but a few are close to 1
.
This scenario basically boils down to comparing the accumulated round-off error between multiple mathematically equivalent ways of computing the same quantity. As had been pointed out in comments, in order to minimize the numerical difference when using finite-precision floating-point computation, the intermediate computation must be carried out in higher than target precision. In particular, in this code this needs to be applied to the computation of average_of_q
, sum1
, and sum2
.
The target precision here is double
, which very likely is mapped to the IEEE-754 binary64
binary floating-point format. Various compilers offer some form of quadruple precision floating-point type, which may or may not be mapped to IEEE-754's binary128
. For example, with the Intel compiler (icl
), the type _Quad
is offered, and works just fine for this code. However, a more portable solution could use Kahan summation to accumulate in quasi-quadruple precision. This is demonstrated below.
From a software perspective it is important to instruct the compiler to not re-associate floating-point expressions, so as to preserve the numerical properties of Kahan summation. The command-line flags to enforce that differ by compiler, in my case it is /fp:strict
, and the entire compiler invocation for the code below was icl /W4 /Ox /QxHOST /fp:strict array_sum_issue.cpp
With n=1000000
the difference |sum1 - sum2
| is usually 0, but occasionally 2-54 corresponding to double-precision epsilon.
#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <random>
#include <vector>
int main()
{
std::random_device rd;
std::mt19937 g{ rd() };
std::uniform_real_distribution<> u;
static std::size_t constexpr n = 1000000;
std::vector<double> q(n);
std::generate_n(q.begin(), q.size(), [&]() { return u(g); });
double average_of_q{};
{
double sum = 0, c = 0;
for (std::size_t i = 0; i < n; ++i) {
double y = q[i] - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
average_of_q = sum / n;
}
std::vector<double> f(n);
std::generate_n(f.begin(), n, [&]() { return u(g); });
double sum1{};
{
double sum = 0, c = 0;
for (std::size_t i = 0; i < n; ++i) {
double y = std::abs(f[i] - q[i]) - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum1 = sum / n;
}
double sum2{};
{
double sum = 0, c = 0;
for (std::size_t i = 0; i < n; ++i) {
double y = (std::abs(f[i] - q[i]) - q[i]) - c;
double t = sum + y;
c = (t - sum) - y;
sum = t;
}
sum2 = std::fma (sum, 1.0 / n, average_of_q);
}
double diff = std::abs (sum1 - sum2);
printf ("average_of_q = % 23.16e (% 23.13a)\n", average_of_q, average_of_q);
printf ("sum1 = % 23.16e (% 23.13a)\n", sum1, sum1);
printf ("sum2 = % 23.16e (% 23.13a)\n", sum2, sum2);
printf ("|sum1-sum2| = % 23.16e (% 23.13a)\n", diff, diff);
return EXIT_SUCCESS;
}
Sample output from the above program (numerical values will differ a bit based on the random numbers generated):
average_of_q = 5.0031728235599426e-01 ( 0x1.0029963aaf686p-1)
sum1 = 3.3347262871877092e-01 ( 0x1.5579d949d5452p-2)
sum2 = 3.3347262871877092e-01 ( 0x1.5579d949d5452p-2)
|sum1-sum2| = 0.0000000000000000e+00 ( 0x0.0000000000000p+0)