Search code examples
jsonbashjqdata-partitioning

jq: How can I pipe objects from array to different files based on data in object?


I have a large array of objects stored in a master JSON file. I want to loop through that array, take each object, and append it to a new file based on a field in the object (in this case, the state name). In other words, in a set of data containing many states, I want to filter it out to a file for each state.

I'm using an existing JQ expression to filter for only the data I actually need:

{ fipscode: .fipscode, level: .level, polid: .polid, polnum: .polnum, precinctsreporting: .precinctsreporting, precinctsreportingpct: .precinctsreportingpct, precinctstotal: .precinctstotal, raceid: .raceid, runoff: .runoff, statepostal: .statepostal, votecount: .votecount, votepct: .votepct, winner: .winner }

Here's a sample of my input:

[
    { "ballotorder": 2, "candidateid": "9718", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Doug", "id": "3015-polid-64364-state-AZ-1", "incumbent": true, "initialization_data": false, "is_ballot_measure": false, "last": "Ducey", "lastupdated": "2018-08-30T00:01:38.897Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-AZ-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Arizona", "statepostal": "AZ", "test": false, "uncontested": false, "votecount": 355455, "votepct": 0.705493, "winner": true },
    { "ballotorder": 2, "candidateid": "21689", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Ron", "id": "10046-polid-62557-state-FL-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "DeSantis", "lastupdated": "2018-08-29T19:29:50.367Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "62557", "polnum": "13918", "precinctsreporting": 5968, "precinctsreportingpct": 1.0, "precinctstotal": 5968, "raceid": "10046", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-FL-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Florida", "statepostal": "FL", "test": false, "uncontested": false, "votecount": 913997, "votepct": 0.564728, "winner": true },
    { "ballotorder": 2, "candidateid": "45555", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Rex", "id": "38538-polid-67011-state-OK-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "Lawhorn", "lastupdated": "2018-08-29T02:44:44.610Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "Lib", "polid": "67011", "polnum": "40784", "precinctsreporting": 1951, "precinctsreportingpct": 1.0, "precinctstotal": 1951, "raceid": "38538", "racetype": "Runoff", "racetypeid": "L", "reportingunitid": "state-OK-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Oklahoma", "statepostal": "OK", "test": false, "uncontested": false, "votecount": 379, "votepct": 0.409287, "winner": false }
]

As output, I would expect to have a Arizona.json containing only the item(s) from that state, and also filtered to remove unwanted fields:

[
  { "fipscode": null, "level": "state", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "runoff": false, "statepostal": "AZ", "votecount": 355455, "votepct": 0.705493, "winner": true }
]

...and likewise for the other states involved (Florida.json and Oklahoma.json).


Here's the bash and jq script I have so far:

cat master.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
jq -c '.statename as $state | {
    fipscode: .fipscode,
    level: .level,
    polid: .polid,
    polnum: .polnum,
    precinctsreporting: .precinctsreporting,
    precinctsreportingpct: .precinctsreportingpct,
    precinctstotal: .precinctstotal,
    raceid: .raceid,
    runoff: .runoff,
    statepostal: .statepostal,
    votecount: .votecount,
    votepct: .votepct,
    winner: .winner
}'

What I can't figure out is how to intercept each row so I can determine where the output should go. Is this possible?


Solution

  • You can do this with one copy of jq splitting out data items from the input file, and then another instance per state collating those data items together, with bash providing the glue. See the following example, for bash 4.2 or newer (might work with 4.1, I'd need to check).

    #!/usr/bin/env bash
    case $BASH_VERSION in ''|[123].*|4.[01].*) echo "ERROR: Bash 4.2 required" >&2; exit 1;; esac
    
    input_file=$1
    [[ -s $input_file ]] || { echo "Usage: ${0##*/} input-file" >&2; exit 1; }
    
    jq_split_script='
    # modify this function to fit your needs
    def relevantContentOnly:
      { fipscode, level, polid, polnum, precinctsreporting, precinctsreportingpct, precinctstotal, raceid, runoff, statepostal, votecount, votepct, winner };
    
    .[] | [.statename, (relevantContentOnly | tojson)] | @tsv
    '
    
    # Use an associative array to map from state names to output FDs
    declare -A out_fds=( )
    
    # Read state / line-of-data pairs from our JQ script...
    while IFS=$'\t' read -r state data; do
      # If we don't already have a writer for the current state, start one.
      if [[ ! ${out_fds[$state]} ]]; then
        exec {new_fd}> >(jq -n '[inputs]' >"$state.json")
        out_fds[$state]=$new_fd
      fi
      # Regardless, send the data to the FD we have for this state
      printf '%s\n' "$data" >&${out_fds[$state]}
    done < <(jq -rc "$jq_split_script" <"$input_file") # ...running the JQ script above.
    
    # close output FDs, so the JQ instances all flush
    for fd in "${!out_fds[@]}"; do
      exec {fd}>&-
    done