Search code examples
c++calgorithmnumericalunderflow

Efficient way to compute geometric mean of many numbers


I need to compute the geometric mean of a large set of numbers, whose values are not a priori limited. The naive way would be

double geometric_mean(std::vector<double> const&data) // failure
{
  auto product = 1.0;
  for(auto x:data) product *= x;
  return std::pow(product,1.0/data.size());
}

However, this may well fail because of underflow or overflow in the accumulated product (note: long double doesn't really avoid this problem). So, the next option is to sum-up the logarithms:

double geometric_mean(std::vector<double> const&data)
{
  auto sumlog = 0.0;
  for(auto x:data) sum_log += std::log(x);
  return std::exp(sum_log/data.size());
}

This works, but calls std::log() for every element, which is potentially slow. Can I avoid that? For example by keeping track of (the equivalent of) the exponent and the mantissa of the accumulated product separately?


Solution

  • The "split exponent and mantissa" solution:

    double geometric_mean(std::vector<double> const & data)
    {
        double m = 1.0;
        long long ex = 0;
        double invN = 1.0 / data.size();
    
        for (double x : data)
        {
            int i;
            double f1 = std::frexp(x,&i);
            m*=f1;
            ex+=i;
        }
    
        return std::pow( std::numeric_limits<double>::radix,ex * invN) * std::pow(m,invN);
    }
    

    If you are concerned that ex might overflow you can define it as a double instead of a long long, and multiply by invN at every step, but you might lose a lot of precision with this approach.

    EDIT For large inputs, we can split the computation in several buckets:

    double geometric_mean(std::vector<double> const & data)
    {
        long long ex = 0;
        auto do_bucket = [&data,&ex](int first,int last) -> double
        {
            double ans = 1.0;
            for ( ;first != last;++first)
            {
                int i;
                ans *= std::frexp(data[first],&i);
                ex+=i;
            }
            return ans;
        };
    
        const int bucket_size = -std::log2( std::numeric_limits<double>::min() );
        std::size_t buckets = data.size() / bucket_size;
    
        double invN = 1.0 / data.size();
        double m = 1.0;
    
        for (std::size_t i = 0;i < buckets;++i)
            m *= std::pow( do_bucket(i * bucket_size,(i+1) * bucket_size),invN );
    
        m*= std::pow( do_bucket( buckets * bucket_size, data.size() ),invN );
    
        return std::pow( std::numeric_limits<double>::radix,ex * invN ) * m;
    }