Search code examples
bashunixterminalpipeline

What's wrong with my these two Bash pipeline methods for counting average file size in a directory?


The problem: I am trying to calculate an average file size for the directory I'm in (ignoring sub-directories) using one-liners. I have two methods:

ls -l | gawk '{sum += $5; n++;} END {print sum/n;}'

and

var1=$(du -Ss| awk '{print $1}') ; var2=$(ls -l | wc -l) ; echo $var1/$var2 | bc

They seem to yield similar numbers, albeit different units (first one in kB, second one in MB).

The numbers themselves however are slightly wrong. What's going on? Which one is more right?


Solution

  • du and ls report differently. Consider this part of the du man page:

       --apparent-size
              print apparent sizes,  rather  than  disk  usage;  although  the
              apparent  size is usually smaller, it may be larger due to holes
              in ('sparse') files, internal  fragmentation,  indirect  blocks,
              and the like
    

    That gives an idea about the possible differences between what ls shows (apparent size) and what du shows (by default, the actual disk usage).

    $ truncate -s 10737418240 sparse
    $ ls -l sparse
    -rw-rw-r-- 1 ec2-user ec2-user 10737418240 Feb 20 00:19 sparse
    $ du sparse
    0       sparse
    $ ls -ls sparse
    0 -rw-rw-r-- 1 ec2-user ec2-user 10737418240 Feb 20 00:19 sparse
    

    The above shows the difference in reporting for a sparse file.

    Also, the counting of files using ls -l will include subdirectories, symlinks, etc. You can instead use find to show only files:

    find . -maxdepth 1 -type f