Search code examples
bashunixawkunix-timestamp

How to use awk to read data between all frequent time intervals


My log file has following format

[30/Jan/2015:10:10:30 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 425
[30/Jan/2015:10:11:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 261
[30/Jan/2015:10:11:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 232
[30/Jan/2015:10:12:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 315
[30/Jan/2015:10:12:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 221
[30/Jan/2015:10:12:57 +0000] 12.30.30.182 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 218

Each line in this log file has timestamp in first field and response time in the last field. Is there a way in awk to read the average response time in all specific intervals? For example, calculating avg response time in every five minutes based on the timestamp in log file.

Or is there any best alternative way to do this other than awk? Please suggest.

Update:

I have tried the following way which is static way of doing it and will give only average of one time interval.

$ grep "30/Jan/2015:10:1[0-4]" mylog.log | awk '{resp+=$NF;cnt++;}END{print "Avg:"int(resp/cnt)}'

But I need to do it for the whole file for all 5 minutes. Even if I loop the command, how can I pass the date dynamically to the command? Because the log file varies every time and the dates in it.


Solution

  • Hm. GNU date does not like your date format, so I guess we'll have to parse it ourselves. I'm thinking along these lines (this requires gawk for mktime):

    # returns the seconds since epoch that stamp represents. This will be
    # the first field in the line, with [] and everything. It's rather
    # rudimentary:
    function parse_timestamp(stamp) {
      # Split stamp into tokens delimited by [, ], /, : or space
      split(stamp, c, "[][/: ]")
    
      # reassemble (using the lookup table for the months from below) in a
      # format that mktime understands (then call mktime).
      return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
    }
    
    BEGIN {
      # parse_timestamp needs this lookup table.
      split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
      for(i = 1; i <= length(mnames); ++i) {
        mnums[mnames[i]] = i
      }
    
      # time is a parameter supplied by you.
      start = parse_timestamp(time)
      end   = start + 300
    
      if(start == -1) {
        print "Warning: Could not parse timestamp \"" time "\""
      }
    }
    
    { 
      # in each line: parse the timestamp
      curtime = parse_timestamp($1)
    }
    
    # if it lies in the interval you want, sum up the last field and increase
    # the counter
    curtime >= start && curtime < end {
      sum += $NF
      ++count
    }
    
    END {
      # and in the end, print the average.
      print "Avg: " (count == 0 ? "undef" : sum / count)
    }
    

    Put this in a file, say average.awk, and call

    awk -v time='[30/Jan/2015:10:11:20 +0000]' -f average.awk foo.log
    

    If you are sure the log file will be sorted in ascending order (which is probably the case), you could make this more efficient by replacing

    curtime >= start && curtime < end {
      sum += $NF
      ++count
    }
    

    with

    curtime >= end {
      exit
    }
    
    curtime >= start {
      sum += $NF
      ++count
    }
    

    This will stop searching for fitting log entries after the first one was found that's after the range you were looking for.

    Addendum: Since OP clarified that he wanted Summaries for all five minute intervals in a sorted makefile, a tweaked script to do that is

    #!/usr/bin/awk -f
    
    function parse_timestamp(stamp) {
      split(stamp, c, "[][/: ]")
      return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
    }
    
    BEGIN {
      split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
      for(i = 1; i <= length(mnames); ++i) {
        mnums[mnames[i]] = i
      }
    }
    
    { 
      curtime = parse_timestamp($1)
    }
    
    NR == 1 {
      # pull the start time from the first line
      start = curtime
      end   = start + 300
    }
    
    curtime > end {
      # print result, reset counters when endtimes are past
      print "Avg: " (count == 0 ? "undef" : sum / count)
      sum   = 0
      count = 0
      end  += 300
    }
    
    {
      sum += $NF
      ++count
    }
    
    END {
      # print once more at the very end for the last, unfinished interval.
      print "Avg: " (count == 0 ? "undef" : sum / count)
    }