Search code examples
bashwcdu

Human-readable filesize and line count


I want a bash command that will return a table, where each row is the human-readable filesize, number of lines, and filename. The table should be sorted by filesize.

I've been trying to do this using a combination of du -hs, wc -l, and sort -h, and find.

Here's where I'm at:

find . -exec echo $(du -h {}) $(wc -l {}) \; | sort -h

Solution

  • Your approach fell short not only because the shell expanded your command substitutions ($(...)) up front, but more fundamentally because you cannot pass shell command lines directly to find:

    find's -exec action can only invoke external utilities with literal arguments - the only non-literal argument supported is the {} representing the filename(s) at hand.

    choroba's answer fixes your immediate problem by invoking a separate shell instance in each iteration, to which the shell command to execute is passed as a string argument (-exec bash -c '...' \;).
    While this works (assuming you pass the {} value as an argument rather than embedding it in the command-line string), it is also quite inefficient, because multiple child processes are created for each input file.

    (While there is a way to have find pass (typically) all input files to a (typically) single invocation of the specified external utility - namely with terminator + rather than \;, this is not an option here due to the nature of the command line passed.)

    An efficient and robust[1] implementation that minimizes the number of child processes created would look like this:

    Note: I'm assuming GNU utilities here, due to use of head -n -1 and sort -h.
    Also, I'm limiting find's output to files only (as opposed to directories), because wc -l only works on files.

    paste <(find . -type f -exec du -h {} +) <(find . -type f -exec wc -l {} + | head -n -1) |
      awk -F'\t *' 'BEGIN{OFS="\t"} {sub(" .+$", "", $3); print $1,$2,$3}' |
       sort -h -t$'\t' -k1,1
    
    • Note the use of -exec ... + rather than -exec ... \;, which ensures that typically all input filenames are passed to a single invocation to the external utility (if not all filenames fit on a single command line, invocations are batched efficiently to make as few calls as possible).

    • wc -l {} + invariably outputs a summary line, which head -n -1 strips away, but also outputs filenames after each line count.

    • paste combines the lines from each command (whose respective inputs are provided by a process substitution. <(...)) into a single output stream.

    • The awk command then strips the extraneous filename that stems from wc from the end of each line.

    • Finally, the sort command sorts the result by the 1st (-k1,1) tab-separated (-t$'\t') column by human-readable numbers (-h), such as the numbers that du -h outputs (e.g., 1K).


    [1] As with any line-oriented processing, filenames with embedded newlines are not supported, but I do not consider this a real-world problem.