Search code examples
unixawkqsub

Unexpected value counts using awk


I have a text file called "test.txt" containing multiple lines with the fields separated by a semicolon. I'm trying to take the value of field3 > strip out everything but the numbers in the field > compare it to the value of field 3 in the previous line > if the value is unique, redirect the field 3 value and the difference between it and the last value to a file called "differences.txt".

so far, i have the following code:

awk -F';' '
BEGIN{d=0} {gsub(/^.*=/,"",$3); 
if(d>0 && $3-d>0){print $3,$3-d} d=$3}
' test.txt > differences.txt

This works absolutely fine when i try to run in the following text:

field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222222;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222333;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222444;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222555;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222777;field4=xxx;field5=xxx
field1=xxx;field2=xxx;field3=111222888;field4=xxx;field5=xxx

output, as expected:

111222333 111
111222444 111
111222555 111
111222777 222
111222888 111

however, when i try and run the following text in, i get completely different, unexpected numbers - I'm not sure if it's due to the increased length of the field or something??

test:

test=none;test=20170606;test=1111111111111111111;
test=none;test=20170606;test=2222222222222222222;
test=none;test=20170606;test=3333333333333333333;
test=none;test=20170606;test=4444444444444444444;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=5555555555555555555;
test=none;test=20170606;test=6666666666666666666;
test=none;test=20170606;test=7777777777777777777;
test=none;test=20170606;test=8888888888888888888;
test=none;test=20170606;test=9999999999999999999;
test=none;test=20170606;test=100000000000000000000;
test=none;test=20170606;test=11111111111111111111;

Output, with unexpected values:

2222222222222222222 1111111111111111168
3333333333333333333 1111111111111111168
4444444444444444444 1111111111111111168
5555555555555555555 1111111111111110656
6666666666666666666 1111111111111111680
7777777777777777777 1111111111111110656
8888888888888888888 1111111111111111680
9999999999999999999 1111111111111110656
100000000000000000000 90000000000000000000

Can anyone see where I'm going wrong, as I'm obviously missing something... and it's driving me mental!!

Many thanks! :)


Solution

  • The numbers in the second example input are too large. Although the logic of the program is correct, there's a loss of precision when doing computations with very large integers, such as 2222222222222222222 - 1111111111111111111 resulting in 1111111111111111168 instead of the expected 1111111111111111111.

    See a detailed explanation in The GNU Awk User’s Guide:

    As has been mentioned already, awk uses hardware double precision with 64-bit IEEE binary floating-point representation for numbers on most systems. A large integer like 9,007,199,254,740,997 has a binary representation that, although finite, is more than 53 bits long; it must also be rounded to 53 bits. The biggest integer that can be stored in a C double is usually the same as the largest possible value of a double. If your system double is an IEEE 64-bit double, this largest possible value is an integer and can be represented precisely. What more should one know about integers?

    If you want to know what is the largest integer, such that it and all smaller integers can be stored in 64-bit doubles without losing precision, then the answer is 2^53. The next representable number is the even number 2^53 + 2, meaning it is unlikely that you will be able to make gawk print 2^53 + 1 in integer format. The range of integers exactly representable by a 64-bit double is [-2^53, 2^53]. If you ever see an integer outside this range in awk using 64-bit doubles, you have reason to be very suspicious about the accuracy of the output.

    As @EdMorton pointed out in a comment, you can have arbitrary-precision arithmetic if your Awk was compiled with MPFR support and you specify the -M flag. For more details, see 15.3 Arbitrary-Precision Arithmetic Features.