I have a large terminal output from a tshark
filter and I want to check if the number of lines (number of pakages in this example) reaches a threshold of X.
The operation is done in a loop of many big files so I want to boost performance to the max here.
What I think to know is that wc -l
is the fastest way to count output from a terminal command.
My line looks like this: (So tshark command does not matter here so I replaced it for readability)
THRESHOLD=100
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | wc -l) -gt $THRESHOLD ]] || echo "not enough"
While this works nearly fine I wonder if there is a way to stop after the threshold. The exact number does not matter as long as it reaches (or reaches not) the threshold.
A guess would be:
HEAD=$((THRESHOLD+1))
[[ $(tshark -r $file -Y "tcp.stream==${streamID}" | head -n $HEAD | wc -l) -gt $THRESHOLD ]] || echo "not enough"
But piping to an additional service and incrementing the threshold could be slower, isn't it?
EDIT: Changing the example code to a working tshark snippet
Only one way to find out: Benchmark it yourself. Here are some implementations that come to mind.
gen() { seq "$max"; }
# functions returning 0 (success) iff `gen` prints less than `$thold` lines
a() { [ "$(gen | head -n"$thold" | wc -l)" != "$thold" ]; }
b() { [ -z "$(gen | tail -n+"$thold" | head -c1)" ]; }
c() { [ "$(gen | grep -cm"$thold" ^)" != "$thold" ]; }
d() { [ "$(gen | grep -Fcm"$thold" '')" != "$thold" ]; }
e() { gen | awk "NR >= $thold{exit 1}"; }
f() { gen | awk -F^ "NR >= $thold{exit 1}"; }
g() { gen | sed -n "$thold"q1; }
h() { mapfile -n1 -s"$thold" < <(gen); [ -z "$MAPFILE" ]; }
max=1''000''000''000
for fn in {a..h}; do
printf '%s: ' "$fn"
for ((thold=1''000''000; thold<=max; thold*=10)); do
printf '%.0e=%2.1fs, ' "$thold" "$({ time -p "$fn"; } 2>&1 | grep -Eom1 '[0-9.]+')"
done
echo
done
In the script from above gen
is a placeholder for your actual command tsharks output lines
. The functions a
to g
test whether tsharks
' output has at least $thold
lines. You can use them like
a && echo "tsharks printed less than $thold lines"
These are the results on my system:
a: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.8s, 1e+09=8.9s,
b: 1e+06=0.0s, 1e+07=0.1s, 1e+08=0.9s, 1e+09=8.4s,
c: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=16.1s,
d: 1e+06=0.0s, 1e+07=0.2s, 1e+08=1.6s, 1e+09=15.7s,
e: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=83.2s,
f: 1e+06=0.1s, 1e+07=0.8s, 1e+08=8.2s, 1e+09=84.6s,
g: 1e+06=0.0s, 1e+07=0.3s, 1e+08=3.0s, 1e+09=31.6s,
h: 1e+06=7.7s, 1e+07=90.0s, ... (manually aborted)
b: ... 1e+08=0.9s ...
means that approach b
took 0.9 seconds to find out that the output of seq 1000000000
had at least 1e+08
(= 100'000'000) lines.
From the approaches presented in this answer b
is clearly the fastest. However, the actual results might differ from system to system (there are different implementations and versions for head
, grep
, ...) and for your atual use-case. I reccommend to benchmark with your actual data (that is, replace the seq
in gen()
with your tsharks output lines
and set thold
to any actually used values).
If you need an even faster approach you can experiment more with stdbuf
and LC_ALL=C
.