Search code examples
arraysjsonobjectjqsubtraction

subtracting one json file from another in jq


Is there a way to compare two json files in jq? Specifically, I'd like to be able to remove objects from one json file if they occur in another json file. Basically, subtract one file from another. It would be a bonus if I could generalize this so that I could define the equality criteria for the objects, but this is not strictly necessary, it can be based strictly on the objects being identical.

So the more general case would look like this. Let's say I have a file that looks like this:

[
  {
    "name": "Cynthia",
    "surname": "Craig",
    "isActive": true,
    "balance": "$2,426.88"
  },
  {
    "name": "Elise",
    "surname": "Long",
    "isActive": false,
    "balance": "$1,892.72"
  },
  {
    "name": "Hyde",
    "surname": "Adkins",
    "isActive": true,
    "balance": "$1,769.34"
  },
  {
    "name": "Matthews",
    "surname": "Jefferson",
    "isActive": true,
    "balance": "$1,991.42"
  },
  {
    "name": "Kris",
    "surname": "Norris",
    "isActive": false,
    "balance": "$2,137.11"
  }
]

And I have a second file that looks like this:

[
  {
    "name": "Cynthia",
    "surname": "Craig"
  },
  {
    "name": "Kris",
    "surname": "Norris"
  }
] 

I'd like to remove any objects from the first file where the name and surname fields match an object of the second file, so that the results should look like this:

[
  {
    "name": "Elise",
    "surname": "Long",
    "isActive": false,
    "balance": "$1,892.72"
  },
  {
    "name": "Hyde",
    "surname": "Adkins",
    "isActive": true,
    "balance": "$1,769.34"
  },
  {
    "name": "Matthews",
    "surname": "Jefferson",
    "isActive": true,
    "balance": "$1,991.42"
  }
] 

Solution

  • The following solution is intended to be generic, efficient and as simple as possible subject to the first two objectives.

    Genericity

    For genericity, let us suppose that $one and $two are two arrays of JSON entities, and that we wish to find those items, $x, in $one such that ($x|filter) does not appear in map($two | filter), where filter is an arbitrary filter. (In the present instance, it is {surname, name}.)

    The solution uses INDEX/1, which was added to jq after the official 1.5 release, so we begin by reproducing its definition:

    def INDEX(stream; idx_expr):
      reduce stream as $row ({};
        .[$row|idx_expr|
          if type != "string" then tojson
          else .
          end] |= $row);
    def INDEX(idx_expr): INDEX(.[]; idx_expr);
    

    Efficiency

    For efficiency, we will need to use a JSON object as a dictionary; since keys must be strings, we will need to ensure that when converting an object to a string, the objects are normalized. For this, we define normalize as follows:

    # Normalize the input with respect to the order of keys in objects
    def normalize:
      . as $in
      | if type == "object" then reduce keys[] as $key
             ( {}; . + { ($key):  ($in[$key] | normalize) } ) 
        elif type == "array" then map( normalize )
        else .
        end;
    

    To construct the dictionary, we simply apply (normalize|tojson):

    def todict(filter):
      INDEX(filter| normalize | tojson);
    

    The solution

    The solution is now quite simple:

    # select those items from the input stream for which 
    # (normalize|tojson) is NOT in dict:
    def MINUS(filter; $dict):
     select( $dict[filter | normalize | tojson] | not);
    
    def difference($one; $two; filter):
      ($two | todict(filter)) as $dict
      | $one[] | MINUS( filter; $dict );
    
    difference( $one; $two; {surname, name} )
    

    Invocation

    $ jq -n --argfile one one.json --argfile two two.json -f difference.jq