Search code examples
awkjqzeek

Trying to reproduce awk in jq


Preamble: Enterprise Network Engineer/Architect (Non-Programmer).
Synopsis: Logs moving from txt output to json
Issue: Not successfully trying to migrate from a functioning awk data extraction to using jq against json using one-liner.
Active awk:

awk '
   BEGIN{ FS="\t" }
  { arr[$1 FS $2] += $3; count[$1 FS $2] += 1 }
  END{ for (key in arr) printf "%s%s%s%s%s\n", key, FS, count[key], FS, arr[key] }
' | sort -nrk 4 | head -1 | awk '{ print $1" | "$2" | "$4/60/60 }'

End Result: using jq to count duplicate entries for src/dst ip address and dst port and ADD the cumulative duration of the connections.

SAMPLE JSON INPUT

{
  "ts": 1636xxxxx.41xxx34,
  "uid": "hex_code",
  "id.orig_h": "10.x.x.11",
  "id.orig_p": 42996,
  "id.resp_h": "10.x.x.123",
  "id.resp_p": 53,
  "proto": "udp",
  "service": "dns",
  "duration": 0.01117664844,
  "conn_state": "SF",
  "local_orig": true,
  "local_resp": true,
  "missed_bytes": 0,
  "history": "Dd",
  "orig_pkts": 1,
  "orig_ip_bytes": 71,
  "resp_pkts": 1,
  "resp_ip_bytes": 71
}
{
  "ts": 1xxxx0501.5xxx47,
  "uid": "hex_code",
  "id.orig_h": "10.x.x.11",
  "id.orig_p": 36299,
  "id.resp_h": "10.x.x.123",
  "id.resp_p": 53,
  "proto": "udp",
  "service": "dns",
  "duration": 0.00857415966797,
  "conn_state": "SF",
  "local_orig": true,
  "local_resp": true,
  "missed_bytes": 0,
  "history": "Dd",
  "orig_pkts": 1,
  "orig_ip_bytes": 74,
  "resp_pkts": 1,
  "resp_ip_bytes": 74
}
targeted jq output...
10.xxx.xxx.21 | 18.xx1.xx1.158 | 45.6606 <--time is shown cumulative duration

Solution

  • As far as I understand what you are trying to accomplish, this might go in your direction:

    jq -sr '
      group_by([."id.orig_h", ."id.resp_h"])[]
      | [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)]
      | @csv
    ' extract.json
    

    Explanation: Your input is a stream of objects. Reading them in using -s (or --slurp) turns the stream into an array. With group_by we transform it into an array of arrays, separating equal values, here given as an array of the two IP fields. Next, for each element of the outer array (the group members) we construct an array of the two IP fields of the first member only (this suffices as all others are equal to this regard), and as third value we add up the values of each group members .duration field. Lastly, the constructed array is transformed to a line of CSV using @csv, which is printed raw thanks to the initial -r (--raw-output) parameter.

    Note: I treated field values like "ts": 1636xxxxx.41xxx34 as obfuscation of an actual number. If, however, it's rather a string containing some x characters, then the string literal needs to be quoted ("ts": "1636xxxxx.41xxx34") for being proper JSON.


    Answering a follow-up question, how to filter out durations equal to 0 and sort the remaining lines by duration from highest to lowest

    jq -sr '
      group_by([."id.orig_h", ."id.resp_h"]) | map(
        [(first|."id.orig_h", ."id.resp_h"), (map(.duration)|add)]
        | select(.[2] > 0)
      )
      | sort_by(-.[2])[]
      | @csv
    ' extract.json