Search code examples
awkrhelnawk

Fetching a specific range of numeric value from huge list of numbers using AWK


i want to take out certain range of huge numeric values, 720000002774991000 to 720000002774991099. so i tried the below command,

awk -F, ' { if (($1 >= 720000002774991000) && ($1 <= 720000002774991099)) print $0} ' VOUCHER_DUMP_REPORT.csv | head

VOUCHER_DUMP_REPORT.csv is my input file and has only one column of that huge numbers.

but the output i am getting is not accurate, it has some values other than the range i gave.

output :

720000002774991065
720000002774991082
720000002774990985
720000002774991131
720000002774990919
720000002774991110
720000002774990947
720000002774991070
720000002774991042
720000002774991044

Solution

  • Looks like your numbers are too long to be correctly represented as integers.

    You have a couple of possible solutions. With GNU awk, you can use -M to enable support for arbitrary precision integers:

    awk -M '$1 >= 720000002774991000 && $1 <= 720000002774991099' file
    

    Otherwise, if you're sure that the first column only contains numbers, you can use a string comparison:

    awk -F, -v min=720000002774991000 -v max=720000002774991099 '
      BEGIN { lmin = length(min); lmax = length(max) } # save length of min and max
      "" $1 < min || "" $1 > max { next }            # skip lines which fail string comparison
      { l1 = length($1) }                            # calculate length of field
      l1 >= lmin && l1 <= lmax                       # check that string length is correct
    ' file
    

    "" $1 concatenates an empty string with the contents of the first field, which forces awk to treat it as a string rather than a number. Without this, the comparison would be numerical, rather than lexical, and you would have the same problem as in your original attempt.

    A probably less efficient but possibly easier to understand version that uses string comparison would be:

    awk -F, -v min=720000002774991000 -v max=720000002774991099 '
      "" $1 >= min && "" $1 <= max \
      && length($1) >= length(min) && length($1) <= length(max)' file
    

    As in the previous version, lines are printed that pass both the string comparison and the length comparison. The downside to this approach is that the length of min, max and $1 are calculated more times than necessary.


    Testing (all three of the above approaches)

    $ cat file
    720000002774991065
    720000002774991082
    720000002774990985
    720000002774991131
    720000002774990919
    720000002774991110
    720000002774990947
    720000002774991070
    720000002774991042
    720000002774991044
    $ awk -M '$1 >= 720000002774991000 && $1 <= 720000002774991099' file
    720000002774991065
    720000002774991082
    720000002774991070
    720000002774991042
    720000002774991044
    $ awk -F, -v min=720000002774991000 -v max=720000002774991099 '
      BEGIN { lmin = length(min); lmax = length(max) } # save length of min and max
      "" $1 < min || "" $1 > max { next }            # skip lines which fail string comparison
      { l1 = length($1) }                            # calculate length of field
      l1 >= lmin && l1 <= lmax                       # check that string length is correct
    ' file
    720000002774991065
    720000002774991082
    720000002774991070
    720000002774991042
    720000002774991044
    $ awk -F, -v min=720000002774991000 -v max=720000002774991099 '
      "" $1 >= min && "" $1 <= max \
      && length($1) >= length(min) && length($1) <= length(max)' file
    720000002774991065
    720000002774991082
    720000002774991070
    720000002774991042
    720000002774991044