Search code examples
normalizationjmespath

jmespath : getting keys with property filter


I have the following json :

{
    "dataset_1": {
        "size_in_mb": 0.5,
        "task": "clean",
        "tags": ["apple", "banana", "strawberry"]
    },
    "dataset_2": {
        "size_in_mb": 100,
        "task": "split",
        "tags": ["apple"]
    },
    "dataset_3": {
        "size_in_mb": 1024,
        "task": "clean",
        "tags": ["strawberry"]
    }
}

How do I :

  1. get datasets which have a tag called "apple"
  2. get datasets which are larger than 500mb
  3. get datasets which have task as "split"

I am able to query the properties of a dataset, but not able to extract the name of the dataset with a certain property. e.g I can get ["strawberry"], but not ["dataset_1", "dataset_3"] when "tags" contains "strawberry".

This question comes close, but basically says you can't use jmespath.


Solution

  • You figured this one out

    • As you stated in a comment, re-normalizing the original dataset to use sequentially-enumerated collation (instead of object-keys for top-level collation) is usually the best way to go, if you want to do general-purpose queries with jmespath.

    • The Stackoverflow post that you linked to goes into a little more detail on that matter here

    Before and After re-normalizing the dataset

    • for the benefit of those who may want more detail on what you meant when you said i ended up changing the schema a little ... here is a "before and after" example of what that can look like

    Before

      {
          "dataset_1": {
              "size_in_mb": 0.5,
              "task": "clean",
              "tags": ["apple", "banana", "strawberry"]
          },
          "dataset_2": {
              "size_in_mb": 100,
              "task": "split",
              "tags": ["apple"]
          },
          "dataset_3": {
              "size_in_mb": 1024,
              "task": "clean",
              "tags": ["strawberry"]
          }
      }
    

    After

      {"dataroot":[
          {
              "name":      "dataset_1",
              "size_in_mb": 0.5,
              "task": "clean",
              "tags": ["apple", "banana", "strawberry"]
          },
          {
              "name":      "dataset_2",
              "size_in_mb": 100,
              "task": "split",
              "tags": ["apple", "banana", "strawberry"]
          },
          {
              "name":      "dataset_3",
              "size_in_mb": 1024,
              "task": "clean",
              "tags": ["strawberry"]
          }
      ]}