Search code examples
c++numerical-methodsnumerical

Reduce rounding error of sum of square of floats


I try to calculate the sum of square of an array of float.

How to reduce the rounding error?

I am trying to sum about 5,000,000 floats in the inner loop of my actual program.

test.cpp:

#include <iostream>
#include <stdint.h>
template <typename Sum, typename Element>
Sum sum(const size_t st, const size_t en) {
    Sum s = 0;
    for (size_t i = st; i < en; ++ i) {
        s += Element(i)*Element(i); 
    }
    return s;
}
int main() {
    size_t size = 100000;
    std::cout << "double, float: " 
              << sum<double, float>(0,size) << "\n";
    std::cout << "int, int: " 
              << sum<int, int>(0,size) << "\n";
}

Output:

double, float: 3.33328e+14
int, int: 216474736

Solution

  • If the format of the floats is known, such as IEEE, then an array indexed by the exponent of a float can be used to store partial sums, then summed up to produce the total sum. During the array update, only floats with the same exponent are added together and stored into the array in the appropriate location. The final summation goes from smallest to largest. For C++, the array and functions could be members of a class.

    Example for floats where the array is passed as a parameter to the functions:

    /* clear array */
    void clearsum(float asum[256])
    {
    size_t i;
        for(i = 0; i < 256; i++)
            asum[i] = 0.f;
    }
    
    /* add a number into array */
    void addtosum(float f, float asum[256])
    {
    size_t i;
        while(1){
            /* i = exponent of f */
            i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
            if(i == 0xff){          /* max exponent, could be overflow */
                asum[i] += f;
                return;
            }
            if(asum[i] == 0.f){     /* if empty slot store f */
                asum[i] = f;
                return;
            }
            f += asum[i];           /* else add slot to f, clear slot */
            asum[i] = 0.f;          /* and continue until empty slot */
        }
    }
    
    /* return sum from array */
    float returnsum(float asum[256])
    {
    float sum = 0.f;
    size_t i;
        for(i = 0; i < 256; i++)
            sum += asum[i];
        return sum;
    }