Search code examples
pythonalgorithmnumpyfloating-pointsum

With pairwise summation, how many terms do I need to get an appreciably wrong result?


Using a given species of fp numbers, say float16, it is straight forward to construct sums with totally wrong results. For example, using python/numpy:

import numpy as np

one = np.float16(1)
ope = np.nextafter(one,one+one)

np.array((ope,one,-one,-one)).cumsum()
# array([1.001, 2.   , 1.   , 0.   ], dtype=float16)

Here we have used cumsum to force naive summation. Left to its own devices numpy would have used a different order of summation, yielding a better answer:

np.array((ope,one,-one,-one)).sum()
# 0.000977

The above is based on cancellation. To rule out this class of examples, let us only allow non negative terms. For naive summation it is still easy to give examples with very wrong sums. The following sums 10^4 identical terms each equal to 10^-4:

np.full(10**4,10**-4,np.float16).cumsum()
# array([1.0e-04, 2.0e-04, 3.0e-04, ..., 2.5e-01, 2.5e-01, 2.5e-01],
  dtype=float16)

The last term is off by a factor of 4.

Again, allowing numpy to use pairwise summation gives a much better result:

np.full(10**4,10**-4,np.float16).sum()
# 1.0

It is possible to construct sums that beat pairwise summation. Choosing eps below resolution at 1 we can use 1, eps, 0, eps, 3x0, eps, 7x0, eps, 15x0, eps, ..., but this involves an insane number of terms.

My question: Using float16 and only non negative terms, how many terms are required to obtain from pairwise summation a result that is off by at least a factor of 2.

Bonus: Same question with "positive" instead of "non negative". Is it even possible?


Solution

  • Depth 1432 (so 2^1432 terms) suffices for the true sum to exceed the computed sum by a factor of two.

    I had an idea for how to determine the number of terms needed to less than a factor of two.

    We use dynamic programming to answer the following question: given a depth d and a target floating point sum s, what is the largest true sum of 2^d nonnegative float16s with pairwise sum s?

    Let that quantity be T(d, s). We get a recurrence

    T(0, s) = s,    for all s.
    T(d, s) =            max            (T(d-1, a) + T(d-1, b)),    for all d, s.
              a, b : float16(a + b) = s
    

    Each step of the recurrence will involve looping over about 2^29 combinations (since we can assume a ≤ b, and negative floats and special values are off limits), and the depth required won't exceed 10^4 or so by Hans's and your answer. Seems feasible to me.

    DP code:

    #include <algorithm>
    #include <cstdio>
    #include <vector>
    
    using Float16 = int;
    using Fixed = unsigned long long;
    
    static constexpr int kExponentBits = 5;
    static constexpr int kFractionBits = 10;
    static constexpr Float16 kInfinity = ((1 << kExponentBits) - 1)
                                         << kFractionBits;
    
    Fixed FixedFromFloat16(Float16 a) {
      int exponent = a >> kFractionBits;
      if (exponent == 0) {
        return a;
      }
      Float16 fraction = a - (exponent << kFractionBits);
      Float16 significand = (1 << kFractionBits) + fraction;
      return static_cast<Fixed>(significand) << (exponent - 1);
    }
    
    bool Plus(Float16 a, Float16 b, Float16* c) {
      Fixed exact_sum = FixedFromFloat16(a) + FixedFromFloat16(b);
      int exponent = 64 - kFractionBits - __builtin_clzll(exact_sum);
      if (exponent <= 0) {
        *c = static_cast<Float16>(exact_sum);
        return true;
      }
      Fixed ulp = Fixed{1} << (exponent - 1);
      Fixed remainder = exact_sum & (ulp - 1);
      Fixed rounded_sum = exact_sum - remainder;
      if (2 * remainder > ulp ||
          (2 * remainder == ulp && (rounded_sum & ulp) != 0)) {
        rounded_sum += ulp;
      }
      exponent = 64 - kFractionBits - __builtin_clzll(rounded_sum);
      if (exponent >= (1 << kExponentBits) - 1) {
        return false;
      }
      Float16 significand = rounded_sum >> (exponent - 1);
      Float16 fraction = significand - (Float16{1} << kFractionBits);
      *c = (exponent << kFractionBits) + fraction;
      return true;
    }
    
    int main() {
      std::vector<Fixed> greatest0(kInfinity);
      for (Float16 a = 0; a < kInfinity; a++) {
        greatest0[a] = FixedFromFloat16(a);
      }
      for (int depth = 1; true; depth++) {
        auto greatest1 = greatest0;
        for (Float16 a = 1; a < kInfinity; a++) {
          Fixed greatest0_a = greatest0[a];
          for (Float16 b = a; b < kInfinity; b++) {
            Float16 c;
            if (!Plus(a, b, &c)) {
              continue;
            }
            Fixed& value = greatest1[c];
            value = std::max(value, greatest0_a + greatest0[b]);
          }
        }
    
        std::vector<double> ratios;
        ratios.reserve(kInfinity - 1);
        for (Float16 a = 1; a < kInfinity; a++) {
          ratios.push_back(greatest1[a] / static_cast<double>(FixedFromFloat16(a)));
        }
        std::printf("depth %d, ratio = %.17g\n", depth,
                    *std::max_element(ratios.begin(), ratios.end()));
        greatest0.swap(greatest1);
      }
    }
    

    I'll run this and post an update when it's done.