Search code examples
jsonbatch-filejqdata-partitioning

Get item and subsequent item based on a property of the first one


I have an event-log file generated by a third-party tool that I cannot change. So, this log file is a huge JSON array where odds elements contain metadata and the pairs contain the body message associated with the meta-data. I want to be able to split the file depending on the metadata, agglomerating the information by subject in different files.

I am working on this project on windows and I am trying it using a batch file and JQ.

Basically the array looks like this:

[
  { "type": "abc123"},
  {"name":"first component of type abc123"},
   { "type": "abc123"},
  {"name":"second component of type abc123"},
  { "type": "def124"},
  {"name":"first component of type def124"},
  { "type": "xyz999"},
  {"name":"first component of type xyz999"},
  { "type": "abc123"},
  {"name":"third component of type abc123"},
  { "type": "def124"},
  {"name":"second component of type def124"},
  { "type": "abc123"},
  {"name":"fifth component of type abc123"},
  { "type": "abc123"},
  {"name":"sixth component of type abc123"},
  { "type": "def124"},
  {"name":"third component of type def124"},
  { "type": "def124"},
  {"name":"fourth component of type def124"},
  { "type": "abc123"},
  {"name":"seventh component of type abc123"},
  { "type": "xyz999"},
  {"name":"second component of type xyz999"}
  ...
]

I know that I only have 3 types, so this is what I am trying to archive is create a file for each of them. something like:

First file

{
  "componentLog": {
       "type": "abc123",
       "information": [
          "first component of type abc123",
          "second component of type abc123",
          "third component of type abc123",
          ...
       ]
     }
}

Second file

{
  "componentLog": {
       "type": "def124",
       "information": [
          "first component of type def124",
          "second component of type def124",
          "third component of type def124",
          ...
       ]
     }
}

Third file

{
  "componentLog": {
       "type": "xyz999",
       "information": [
          "first component of type xyz999",
          "second component of type xyz999",
          "third component of type xyz999",
          ...
       ]
     }
}

I know that I can separate the metadata with this

jq.exe ".[] | select(.type==\"product\")" file.json

And then I try to math the index.But index just returns the index of the first item that contains the select statement... So I don't know how to solve this...


Solution

  • The following bash script is a bit messy because it assumes none of the files (input or output) will fit into memory.

    If you don't already have access to bash, sed and awk in your computing environment, you might want to consider installing , , or some such, or you could adapt the script as appropriate, e.g. using gawk for Windows, or Ruby for Windows.

    The other main assumption not already embedded in the original question is that it's OK to remove the log-type*.tmp files and overwrite log-TYPE.json for the various values of "type".

    Be sure to set input to the appropriate input file name.

    # The input file name:
    input=file.json
    
    /bin/rm log-type*.tmp
    
    # Use jq to produce a stream of .type and .name values 
    # as per the jq FAQ
    jq -cn --stream '
       fromstream(1|truncate_stream(inputs))
       | if .type then .type else .name end'  "$input" |
     awk '
          NR%2 {fn=$1; sub("^\"","",fn); sub("\"$","", fn); next;} 
          { print > "log-type." fn ".tmp"}
    '
    
    for f in log-type.*.tmp ; do
        echo formatting $f ...
        g=$(sed -e 's/log-type.//' -e 's/.tmp$//' <<< "$f")
        echo g="$g"
        awk -v type="\"$g\"" '
          BEGIN { print "{\"componentLog\": { \"type\": " type " ,";
          print "\"information\": ["; }
          NR==1 { print; next }
          {print ",", $0} 
          END {print "]}}"; }' "$f" > "log-$g.json"
    done