awk and gawk with large integers and large powers of 2

It was my understanding that both POSIX awk and GNU awk use IEEE 754 double for both integer and floats. (I know the -M switch is available on GNU awk for arbitrary precision integers. This question assumes without -M selected...)

This means that the max size of integer result with awk / gawk / perl (those without AUTOMATIC promotion to arbitrary precision integers) would be 53 bits since this is the max size integer that can fit in a IEEE 754 double. (At magnitudes greater than 2^53, you can no longer expect ±1 to work as it would with an integer but floating point arithmetic still works within the limits of a IEEE double.)

It seems to be easily demonstrated.

These work as expected with correct results (to the last digit) on both awk and gawk:

$ gawk 'BEGIN{print 2**52-1}'
4503599627370495
$ gawk 'BEGIN{print 2**52+1}'
4503599627370497
$ gawk 'BEGIN{print 2**53-1}'
9007199254740991

This is off by 1 (and is what I would expect with 53 bit max integer):

$ gawk 'BEGIN{print 2**53+1}'      # 9007199254740993 is the correct result
9007199254740992

But here is what I would NOT expect. With certain power of 2 values both awk and GNU awk perform integer arithmetic at far greater precision than is possible within 53 bits.

(On my system, /usr/bin/awk is MacOS POSIX awk; gawk is GNU awk.)

Consider these examples, all precise to the digit:

$ gawk 'BEGIN{print 2**230}'  # float result with awk...
1725436586697640946858688965569256363112777243042596638790631055949824

$ /usr/bin/awk 'BEGIN{print 2**99}'   # max that POSIX awk supports
633825300114114700748351602688

The precision of ±1 is not supported at these magnitudes but limited arithmetic operations with powers of 2 are supported. Again, precise to the digit:

$ /usr/bin/awk 'BEGIN{print 2**99-2**98}'
316912650057057350374175801344    

$ /usr/bin/awk 'BEGIN{print 2**99+2**98}'
950737950171172051122527404032

$ gawk 'BEGIN{print 2**55-968}'  # 2^55=36028797018963968
36028797018963000

I am speculating that awk and gawk have some sort of non standard way of recognizing that 2^N is equivalent to 2<<N and doing some limited math inside of that arena.

Any form of [integer > 2] ^ Y with the result being greater than 2^53 has a drop in precision that is expected. ie, 10^15 is the rough max integer for ±1 to be accurate since 10^16 requires 54 bits.

$ gawk 'BEGIN{print 10**15+1}'  # correct
1000000000000001

$ gawk 'BEGIN{print 10**16+1}'  # not correct
10000000000000000

This is correct in magnitude for 10**64 but only precise for the first 16 digits (which I would expect):

$ gawk 'BEGIN{print 10**64}'
10000000000000001674705827425446886926697411428962669123675881472
# should be '1' + 64 '0'
# This is just a presentation issue of a value implying greater precision...

The GNU document is not exactly helpful since it speaks of the max values for 64 bit unsigned and signed integers implying those are used somehow. But it is easy to demonstrate that with the exception of powers of 2, the max integer on gawk is 2**53

Questions:

Am I correct that ALL integer calculations in awk / gawk are in fact IEEE doubles with max value of 2**53 for ±1? Is that documented somewhere?
If that is correct, what is happening with larger powers of 2?

(It would be nice if there were automatic switching to float format (the way Perl does) at that magnitude where there is a loss of precision btw.)

Solution

I cannot speak to the numeric implementations used in particular versions of gawk or awk. This answer speaks to floating-point generally, particularly IEEE-754 binary formats.

Computing 2⁹⁹ for 2**99 and 2²³⁰ for 2**230 are simply normal operations for floating-point arithmetic. Each is represented with a significand with one significant binary digit, 1, and an exponent of 99 or 230. Whatever routine is used to implement the exponentiation operation is presumably doing its job correctly. Since binary floating-point represents a number using a sign, a significand, and a scaling of two to some power, 2⁹⁹ and 2²³⁰ are easily represented.

When these numbers are printed, some routine is called to convert them to decimal numerals. This routine also appears to be well implemented, producing correct output. Some work is required to do that conversion correctly, as implementing it with naïve arithmetic will introduce rounding errors that produce incorrect results. (Sometimes little engineering effort is given to conversion routines and they produce results accurate only to a limited number of significant decimal digits. This appears to be less common; correctly rounded implementations are more common now than they used to be.)

Apparent “loss of precision,” more accurately called “loss of accuracy” or “rounding errors,” occurs when results cannot be exactly implemented (such as 2⁵³+1) or when floating-point operations are implemented without correct rounding. For 2⁹⁹ and 2²³⁰, no such loss is imposed by the floating-point format.

This means that the max size of integer result with awk / gawk / perl… would be 53 bits… ”

This is incorrect, or at least incorrectly phrased. The last consecutive integer that can be represented in IEEE-754 64-bit binary is 2⁵³. But it is certainly not the maximum. 2⁵³+2 can also be represented, having skipped 2⁵³+1. There are many more integers larger than 2⁵³ that can be represented.