Search code examples
bashwc

Fastest way to count just the first X lines of output


I have a large terminal output from a tshark filter and I want to check if the number of lines (number of pakages in this example) reaches a threshold of X.

The operation is done in a loop of many big files so I want to boost performance to the max here.

What I think to know is that wc -lis the fastest way to count output from a terminal command.

My line looks like this: (So tshark command does not matter here so I replaced it for readability)

THRESHOLD=100
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | wc -l) -gt $THRESHOLD ]] || echo "not enough"

While this works nearly fine I wonder if there is a way to stop after the threshold. The exact number does not matter as long as it reaches (or reaches not) the threshold.

A guess would be:

HEAD=$((THRESHOLD+1))
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | head -n $HEAD | wc -l) -gt $THRESHOLD ]] || echo "not enough"

But piping to an additional service and incrementing the threshold could be slower, isn't it?

EDIT: Changing the example code to a working tshark snippet


Solution

  • Benchmark

    Only one way to find out: Benchmark it yourself. Here are some implementations that come to mind.

    gen() { seq "$max"; }
    # functions returning 0 (success) iff `gen` prints less than `$thold` lines
    a() { [ "$(gen | head -n"$thold" | wc -l)" != "$thold" ]; }
    b() { [ -z "$(gen | tail -n+"$thold" | head -c1)" ]; }
    c() { [ "$(gen | grep -cm"$thold" ^)" != "$thold" ]; }
    d() { [ "$(gen | grep -Fcm"$thold" '')" != "$thold" ]; }
    e() { gen | awk "NR >= $thold{exit 1}"; }
    f() { gen | awk -F^ "NR >= $thold{exit 1}"; }
    g() { gen | sed -n "$thold"q1; }
    h() { mapfile -n1 -s"$thold" < <(gen); [ -z "$MAPFILE" ]; }
    
    max=1''000''000''000
    for fn in {a..h}; do
      printf '%s: ' "$fn"
      for ((thold=1''000''000; thold<=max; thold*=10)); do
        printf '%.0e=%2.1fs, ' "$thold" "$({ time -p "$fn"; } 2>&1 | grep -Eom1 '[0-9.]+')"
      done
      echo
    done
    

    In the script from above gen is a placeholder for your actual command tsharks output lines. The functions a to g test whether tsharks' output has at least $thold lines. You can use them like

    a && echo "tsharks printed less than $thold lines"
    

    Results

    These are the results on my system:

    a: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.8s, 1e+09=8.9s,
    b: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.9s, 1e+09=8.4s,
    c: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=16.1s,
    d: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=15.7s,
    e: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=83.2s,
    f: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=84.6s,
    g: 1e+06=0.0s, 1e+07=0.3s, 1e+08=3.0s, 1e+09=31.6s,
    h: 1e+06=7.7s, 1e+07=90.0s, ... (manually aborted)
    

    b: ... 1e+08=0.9s ... means that approach b took 0.9 seconds to find out that the output of seq 1000000000 had at least 1e+08 (= 100'000'000) lines.

    Conclusion

    From the approaches presented in this answer b is clearly the fastest. However, the actual results might differ from system to system (there are different implementations and versions for head, grep, ...) and for your atual use-case. I reccommend to benchmark with your actual data (that is, replace the seq in gen() with your tsharks output lines and set thold to any actually used values).

    If you need an even faster approach you can experiment more with stdbuf and LC_ALL=C.