I wrote a code to calculate the sum of the series 2^(-k), but I don't know how to improve the accuracy of this calculation. This is what I've done so far.
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
int i, n;
float sum = 0;
cout << "Enter the value of n: ";
cin >> n;
for (i=1; i<=n; i++)
sum += 1.0/pow(2,i);
cout << "Sum: " << sum;
return 0;
}
Any suggestion and/or help is very appreciated.
To see more precise output, you need to request more precision than the C++ default. One way to do this is:
#include <iomanip>
…
std::cout << std::setprecision(99);
Next, consider this code:
for (i=1; i<=n; i++)
sum += 1.0/pow(2,i);
First, recognize that the quality of pow
implementations varies. The C and C++ standards are lax about the quality of floating-point operations, and some pow
implementations return results for simple cases like pow(10, 3)
that differ slightly from the mathematical result. Due to the ways pow
is frequently implemented, pow(2, i)
might not suffer from this problem, but it should be considered.
Let’s assume pow(2, i)
computes the proper result exactly. Let’s also assume that your C++ implementation uses the common IEEE-754 basic 32-bit binary floating-point format for float
. If so, there is no error in the sum computed above for n
≤ 24.
This is because each term, 1.0/pow(2, i)
, is representable as a single bit in the significand (fraction part) of a float
, and the float
has 24-bit significands, so 24 consecutive bits can be represented with no error. Once you increase the precision used to format output, the sums shown for n
≤ 24 should be exact.
When n
= 25, the sum no longer fits in a float
. At this point, the mathematical result will be rounded to the nearest value representable in a float
, generally using the rule that, if there is a tie between the two nearest representable values, the one with the even low bit will be chosen. This means the result will be 1, exactly. For all n
> 24, the result will be 1.
While using the float
type, it is not possible to increase the accuracy beyond this. That is because, of all the values that can be represented in the float
type, 1 is the value closest to the exact mathematical sum of the series. There simply is no closer representable value, so no computation or alteration of source code can produce any more accurate value.
You can produce more accurate values by using double
instead of float
. If the IEEE-754 basic 64-bit binary format is used for double
, then this will produce exact results for n
≤ 53. For n
> 53, the result will again be 1, and the sum could be improved only by using extended-precision arithmetic.
Additionally, note that:
float sum = 0;
for (i=1; i<=n; i++)
sum += 1.0/pow(2,i);
is mathematically equivalent to:
float sum = 1 - pow(2.f, (float) -n);