jmespath : getting keys with property filter

I have the following json :

{
    "dataset_1": {
        "size_in_mb": 0.5,
        "task": "clean",
        "tags": ["apple", "banana", "strawberry"]
    },
    "dataset_2": {
        "size_in_mb": 100,
        "task": "split",
        "tags": ["apple"]
    },
    "dataset_3": {
        "size_in_mb": 1024,
        "task": "clean",
        "tags": ["strawberry"]
    }
}

How do I :

get datasets which have a tag called "apple"
get datasets which are larger than 500mb
get datasets which have task as "split"

I am able to query the properties of a dataset, but not able to extract the name of the dataset with a certain property. e.g I can get ["strawberry"], but not ["dataset_1", "dataset_3"] when "tags" contains "strawberry".

This question comes close, but basically says you can't use jmespath.

Solution

You figured this one out

As you stated in a comment, re-normalizing the original dataset to use sequentially-enumerated collation (instead of object-keys for top-level collation) is usually the best way to go, if you want to do general-purpose queries with jmespath.
The Stackoverflow post that you linked to goes into a little more detail on that matter here

Before and After re-normalizing the dataset

for the benefit of those who may want more detail on what you meant when you said i ended up changing the schema a little ... here is a "before and after" example of what that can look like

Before

  {
      "dataset_1": {
          "size_in_mb": 0.5,
          "task": "clean",
          "tags": ["apple", "banana", "strawberry"]
      },
      "dataset_2": {
          "size_in_mb": 100,
          "task": "split",
          "tags": ["apple"]
      },
      "dataset_3": {
          "size_in_mb": 1024,
          "task": "clean",
          "tags": ["strawberry"]
      }
  }

After

  {"dataroot":[
      {
          "name":      "dataset_1",
          "size_in_mb": 0.5,
          "task": "clean",
          "tags": ["apple", "banana", "strawberry"]
      },
      {
          "name":      "dataset_2",
          "size_in_mb": 100,
          "task": "split",
          "tags": ["apple", "banana", "strawberry"]
      },
      {
          "name":      "dataset_3",
          "size_in_mb": 1024,
          "task": "clean",
          "tags": ["strawberry"]
      }
  ]}