I've been using a shell script for ~3 years that starts a tail -F -n0 /path/to/LOGFILE
, parses and formats the output and dumps it into a pipe delimited file. However, we've gone from several thousand log lines a day to several million log lines a day and the script has started to eat huge amounts of memory and CPU.
I recently gone and replaced all of the data parsing logic with awk
, which seems in testing to be many MANY orders of magnitude faster at parsing data. To test the new awk code I tried pumping an entire days worth of logs through the logic (~6 million lines) using both the awk code and the shell code and unsurprisingly the awk code pulled the ~92K relevant lines in ~10 seconds while the shell code took over 15 minutes to do the exact same thing. However, if I use the exact same code, but instead of cat /path/to/file|awk '...
' I do tail -F -n0 /path/to/file|awk '...
there is a HUGE delay as to when the text is written to the file, up to ~2-3 minutes as opposed to ~0.5-1.0 seconds when it's going through the shell code.
Shell code (yes I know how ugly shell code is):
outDir="/opt/redacted/logs/allerrors/"
tail -F -n0 /opt/redacted/logs/manager.log|while read -a logLine;do
if [[ "${logLine[2]}" == "E" ]];then
if [[ "${logLine[7]:0:1}" == "@" ]];then
echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|${logLine[7]}|${logLine[@]:8:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
else
echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|NULL|${logLine[@]:7:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
fi
fi
done
awk code:
outDir="/opt/redacted/logs/allerrors/"
tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
{
if ($3 == "E")
{
file="allerrors."$1".log"
gsub("/",".",file)
if ($8 ~ /@/)
print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
else {if ($8 !~ /@/)
print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
}
}
}'
To be clear, both sets of code work and create identical output if I use cat instead of tailing the file, but with the awk code I don't see results in my output file until ~2-3 minutes after it appears in the log while the shell version only takes a few seconds.
awk
buffers by default while sh
does not. This increases awk's throughput but also its latency.
Just add fflush();
to your awk code to force the buffers to flush:
outDir="/opt/redacted/logs/allerrors/"
tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
{
if ($3 == "E")
{
file="allerrors."$1".log"
gsub("/",".",file)
if ($8 ~ /@/)
print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
else {if ($8 !~ /@/)
print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
}
fflush();
}
}'