Search code examples
c++c++20precisionnumericnumerical-methods

How can I minimize the numerical error in this summation?


I'm suffering from a numerical error in the following code example (I've added the Kahan summation attempt and a more clever naive, but still naive version, below; but it's even worse unfortunately):

#include <algorithm>
#include <iostream>
#include <random>
#include <vector>


int main()
{
    std::random_device rd;
    std::mt19937 g{ rd() };
    std::uniform_real_distribution<> u;

    static std::size_t constexpr n = 1000;

    std::vector<double> q(n);
    std::generate_n(q.begin(), q.size(), [&]() { return u(g); });

    double average_of_q{};
    for (auto const& q : q)
        average_of_q += q;
    average_of_q /= n;

    std::vector<double> f(n);
    std::generate_n(f.begin(), n, [&]() { return u(g); });

    double sum1{};
    for (std::size_t i = 0; i < n; ++i)
        sum1 += std::abs(f[i] - q[i]);
    sum1 /= n;

    {
        double sum2{};
        for (std::size_t i = 0; i < n; ++i)
            sum2 += std::abs(f[i] - q[i]) - q[i];
        sum2 = sum2 / n + average_of_q;

        std::cout << "naive: " << std::abs(sum1 - sum2) << std::endl;
    }
    {
        double sum2{},
            c{};
        for (std::size_t i = 0; i < n; ++i)
        {
            double const x = std::abs(f[i] - q[i]) - q[i] - c,
                s = sum2 + x;
            c = (s - sum2) - x;

            sum2 = s;
        }
        sum2 = sum2 / n + average_of_q;

        std::cout << "kahan: " << std::abs(sum1 - sum2) << std::endl;
    }
    {
        double sum2{};
        for (std::size_t i = 0; i < n; ++i)
        {
            if (f[i] - q[i] >= 0)
                sum2 += f[i] - 2 * q[i];
            else
                sum2 -= f[i];
        }
        sum2 = sum2 / n + average_of_q;

        std::cout << "more clever, but still naive: " << std::abs(sum1 - sum2) << std::endl;
    }

    return 0;
}

The output is 1.11022e-16, while we would theoretically expect that it should be 0. How can I optimize this code such that std::abs(sum1 - sum2) is as small as possible?

To motivate this: In my actual application, I already know average_of_q and I don't need to iterate over every i, since I know that std::abs(f[i] - q[i]) is extremely small for most of the i, which is why I need to want to use the formula for sum2.

EDIT: I've asked for the theoretic part of this question on MSE as well (but it's slightly different; I didn't want to make things too complicated here): https://math.stackexchange.com/q/4688917/47771.

EDIT 2: I've also tried to "boost" the terms of the sum by multiplying them with a factor:

{
    double sum2{};
    for (std::size_t i = 0; i < n; ++i)
        sum2 += 1000 * (std::abs(f[i] - q[i]) - q[i]);
    sum2 = sum2 / (1000 * n) + average_of_q;

    std::cout << "boosted: " << std::abs(sum1 - sum2) << std::endl;
}

EDIT 3:

It might be a useful information: In my actual application, many of the f[i] are small compared to q[i]. For simplicity, you can assume that all q[i] = 1, many of the f[i] are around 1e-10, but a few are close to 1.


Solution

  • This scenario basically boils down to comparing the accumulated round-off error between multiple mathematically equivalent ways of computing the same quantity. As had been pointed out in comments, in order to minimize the numerical difference when using finite-precision floating-point computation, the intermediate computation must be carried out in higher than target precision. In particular, in this code this needs to be applied to the computation of average_of_q, sum1, and sum2.

    The target precision here is double, which very likely is mapped to the IEEE-754 binary64 binary floating-point format. Various compilers offer some form of quadruple precision floating-point type, which may or may not be mapped to IEEE-754's binary128. For example, with the Intel compiler (icl), the type _Quad is offered, and works just fine for this code. However, a more portable solution could use Kahan summation to accumulate in quasi-quadruple precision. This is demonstrated below.

    From a software perspective it is important to instruct the compiler to not re-associate floating-point expressions, so as to preserve the numerical properties of Kahan summation. The command-line flags to enforce that differ by compiler, in my case it is /fp:strict, and the entire compiler invocation for the code below was icl /W4 /Ox /QxHOST /fp:strict array_sum_issue.cpp

    With n=1000000 the difference |sum1 - sum2| is usually 0, but occasionally 2-54 corresponding to double-precision epsilon.

    #include <algorithm>
    #include <cstdio>
    #include <cstdlib>
    #include <random>
    #include <vector>
    
    int main()
    {
        std::random_device rd;
        std::mt19937 g{ rd() };
        std::uniform_real_distribution<> u;
    
        static std::size_t constexpr n = 1000000;
    
        std::vector<double> q(n);
        std::generate_n(q.begin(), q.size(), [&]() { return u(g); });
    
        double average_of_q{};
        {
            double sum = 0, c = 0;
            for (std::size_t i = 0; i < n; ++i) {
                double y = q[i] - c;
                double t = sum + y;
                c = (t - sum) - y;
                sum = t;
            }
            average_of_q = sum / n;
        }
    
        std::vector<double> f(n);
        std::generate_n(f.begin(), n, [&]() { return u(g); });
    
        double sum1{};
        {
            double sum = 0, c = 0;
            for (std::size_t i = 0; i < n; ++i) {
                double y = std::abs(f[i] - q[i]) - c;
                double t = sum + y;
                c = (t - sum) - y;
                sum = t;
            }
            sum1 = sum / n;
        }
    
        double sum2{};
        {
            double sum = 0, c = 0;
            for (std::size_t i = 0; i < n; ++i) {
                double y = (std::abs(f[i] - q[i]) - q[i]) - c;
                double t = sum + y;
                c = (t - sum) - y;
                sum = t;
            }
            sum2 = std::fma (sum, 1.0 / n, average_of_q);
        }
        double diff = std::abs (sum1 - sum2);
        printf ("average_of_q = % 23.16e (% 23.13a)\n", average_of_q, average_of_q);
        printf ("sum1         = % 23.16e (% 23.13a)\n", sum1, sum1);
        printf ("sum2         = % 23.16e (% 23.13a)\n", sum2, sum2);
        printf ("|sum1-sum2|  = % 23.16e (% 23.13a)\n", diff, diff);
    
        return EXIT_SUCCESS;
    }
    

    Sample output from the above program (numerical values will differ a bit based on the random numbers generated):

    average_of_q =  5.0031728235599426e-01 (   0x1.0029963aaf686p-1)
    sum1         =  3.3347262871877092e-01 (   0x1.5579d949d5452p-2)
    sum2         =  3.3347262871877092e-01 (   0x1.5579d949d5452p-2)
    |sum1-sum2|  =  0.0000000000000000e+00 (   0x0.0000000000000p+0)