Search code examples
awkwc

Why does the number of lines in a file reported by wc differ from the number of records read by awk?


When I count the number of lines in a file using awk:

cat ~/.account | wc -l

... the result is:

384

But when I use awk:

awk 'BEGIN {x = "1.02"; y = 0; } {x = x*2; y = y + 1} END {print x; print y}' ~/.account

... the result is:

8.03800926406447389928897056654e+115

385

Why is this?


Solution

  • What wc -l is doing

    From man wc:

    -l, --lines

    print the newline counts

    Using wc -l counts the number of newline characters and awk separates the input into records separated by newline characters.

    Consider this example:

    $ echo 1 | wc -l
    1
    $ echo -n 1 | wc -l
    0
    

    The input for the first command (echo 1 ) is the string "1\n". Using -n with echo echos the 1 without a newline at the end, which makes the input just the string "1". The wc -l counts the newline characters in the input. In the first case, there is one newline and in the second there are none.

    What AWK is doing

    AWK divides its input into records, and each record into fields. This is an important part of the parsing magic that AWK does for us.

    From The GNU AWK User's Guide (but referring to standard AWK):

    Records are separated by a character called the record separator. By default, the record separator is the newline character. This is why records are, by default, single lines.

    But if the input ends with this separator, see what happens:

    $ echo 1 | awk 'END{print NR}'
    1
    $ echo -n 1 | awk 'END{print NR}'
    1
    

    (NR is a special variable for "the total number of input records read so far from all data files.")

    There is only one record in each case, even the first ("1\n") that contains a newline character. Since there is nothing after the separator, it separates nothing. In other words, it does not give an empty record at the end if the input ends with the separator.

    If your input file does not end in a newline character, wc -l will report one less than awk's number of records (NR).