Search code examples
bashshelloptimizationawklogparser

Why is printing data to a file considerably slower with awk than with the shell?


I've been using a shell script for ~3 years that starts a tail -F -n0 /path/to/LOGFILE, parses and formats the output and dumps it into a pipe delimited file. However, we've gone from several thousand log lines a day to several million log lines a day and the script has started to eat huge amounts of memory and CPU.

I recently gone and replaced all of the data parsing logic with awk, which seems in testing to be many MANY orders of magnitude faster at parsing data. To test the new awk code I tried pumping an entire days worth of logs through the logic (~6 million lines) using both the awk code and the shell code and unsurprisingly the awk code pulled the ~92K relevant lines in ~10 seconds while the shell code took over 15 minutes to do the exact same thing. However, if I use the exact same code, but instead of cat /path/to/file|awk '...' I do tail -F -n0 /path/to/file|awk '... there is a HUGE delay as to when the text is written to the file, up to ~2-3 minutes as opposed to ~0.5-1.0 seconds when it's going through the shell code.

Shell code (yes I know how ugly shell code is):

outDir="/opt/redacted/logs/allerrors/"
tail -F -n0 /opt/redacted/logs/manager.log|while read -a logLine;do
if [[ "${logLine[2]}" == "E" ]];then
  if [[ "${logLine[7]:0:1}" == "@" ]];then
    echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|${logLine[7]}|${logLine[@]:8:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
  else
    echo "${logLine[0]}|${logLine[1]}|${logLine[6]}|NULL|${logLine[@]:7:${#logLine[@]}}" >> ${outDir}allerrors.${logLine[0]//\//.}.log
  fi
fi

done

awk code:

  outDir="/opt/redacted/logs/allerrors/"
  tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
{
  if ($3 == "E")
  {
    file="allerrors."$1".log"
    gsub("/",".",file)
    if ($8 ~ /@/)
      print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
    else {if ($8 !~ /@/)
      print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
    }
  }
}'

To be clear, both sets of code work and create identical output if I use cat instead of tailing the file, but with the awk code I don't see results in my output file until ~2-3 minutes after it appears in the log while the shell version only takes a few seconds.


Solution

  • awk buffers by default while sh does not. This increases awk's throughput but also its latency.

    Just add fflush(); to your awk code to force the buffers to flush:

      outDir="/opt/redacted/logs/allerrors/"
      tail -F -n0 /opt/redacted/logs/manager.log|awk -v dir=$outDir '{OFS="|"}
    {
      if ($3 == "E")
      {
        file="allerrors."$1".log"
        gsub("/",".",file)
        if ($8 ~ /@/)
          print $1,$2,$7,$8,substr($0, index($0,$9)) >> dir file
        else {if ($8 !~ /@/)
          print $1,$2,$7,"NULL",substr($0, index($0,$8)) >> dir file
        }
        fflush();
      }
    }'