Search code examples
elasticsearchelasticsearch-aggregationelasticsearch-dsl

Elasticsearch - Return a subset of nested results


Elasticsearch 7.7 and I'm using the official php client to interact with the server.

My issue was somewhat solved here: https://discuss.elastic.co/t/need-to-return-part-of-a-doc-from-a-search-query-filter-is-parent-child-the-way-to-go/64514/2

However "Types are deprecated in APIs in 7.0+" https://www.elastic.co/guide/en/elasticsearch/reference/7.x/removal-of-types.html

Here is my document:

{
  "offering_id": "1190",
  "account_id": "362353",
  "service_id": "20087",
  "title": "Quick Brown Mammal",
  "slug": "Quick Brown Fox",
  "summary": "Quick Brown Fox"
  "header_thumb_path": "uploads/test/test.png",
  "duration": "30",
  "alter_ids": [
    "59151",
    "58796",
    "58613",
    "54286",
    "51812",
    "50052",
    "48387",
    "37927",
    "36685",
    "36554",
    "28807",
    "23154",
    "22356",
    "21480",
    "220",
    "1201",
    "1192"
  ],
  "premium": "f",
  "featured": "f",
  "events": [
    {
      "event_id": "9999",
      "start_date": "2020-07-01 14:00:00",
      "registration_count": "22",
      "description": "boo"
    },
    {
      "event_id": "9999",
      "start_date": "2020-07-01 14:00:00",
      "registration_count": "22",
      "description": "xyz"
    },
    {
      "event_id": "9999",
      "start_date": "2020-08-11 11:30:00",
      "registration_count": "41",
      "description": "test"
    }
  ]
}

Notice how the object may have one or many "events"

Searching based on event data is the most common use case.

For example:

  • Find events that start before 12pm
  • Find events with a description of "xyz"
  • List find events with a start date in the next 10 days.

I would like to NOT return any events that didn't match the query!


So, for example Find events with a description of "xyz" for a given service

{
  "query": {
    "bool": {
      "must": {
        "match": {
          "events.description": "xyz"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "service_id": 20087
              }
            }
          ]
        }
      }
    }
  }
}

I would want the result to look like this:

{
  "offering_id": "1190",
  "account_id": "362353",
  "service_id": "20087",
  "title": "Quick Brown Mammal",
  "slug": "Quick Brown Fox",
  "summary": "Quick Brown Fox"
  "header_thumb_path": "uploads/test/test.png",
  "duration": "30",
  "alter_ids": [
    "59151",
    "58796",
    "58613",
    "54286",
    "51812",
    "50052",
    "48387",
    "37927",
    "36685",
    "36554",
    "28807",
    "23154",
    "22356",
    "21480",
    "220",
    "1201",
    "1192"
  ],
  "premium": "f",
  "featured": "f",
  "events": [
    {
      "event_id": "9999",
      "start_date": "2020-07-01 14:00:00",
      "registration_count": "22",
      "description": "xyz"
    }
  ]
}

However, instead it just returns the ENTIRE document, with all events.

Is it even possible to return only a subset of the data? Maybe with Aggregations?

  • Right now, we're doing an "extra" set of filtering on the result set in the application (php in this case) to strip out event blocks that don't match the desired results.
  • It would be nice to just have elastic give directly what's needed instead of doing extra processing on the result to pull out the applicable event.
  • Thought about restructuring the data to instead have it based around "events" but then I would be duplicating data since every offering will have the parent data too.

This used to be in SQL, where there was a relation instead of having the data nested like this.


Solution

  • A subset of the nested data can be returned using Nested Aggregations along with Filter Aggregations

    To know more about these aggregations refer these official documentation :

    Filter Aggregation

    Nested Aggregation

    Index Mapping:

    {
      "mappings": {
        "properties": {
          "offering_id": {
            "type": "integer"
          },
          "account_id": {
            "type": "integer"
          },
          "service_id": {
            "type": "integer"
          },
          "title": {
            "type": "text"
          },
          "slug": {
            "type": "text"
          },
          "summary": {
            "type": "text"
          },
          "header_thumb_path": {
            "type": "keyword"
          },
          "duration": {
            "type": "integer"
          },
          "alter_ids": {
            "type": "integer"
          },
          "premium": {
            "type": "text"
          },
          "featured": {
            "type": "text"
          },
          "events": {
            "type": "nested",
            "properties": {
              "event_id": {
                "type": "integer"
              },
              "registration_count": {
                "type": "integer"
              },
              "description": {
                "type": "text"
              }
            }
          }
        }
      }
    }
    

    Search Query :

    {
      "size": 0,
      "aggs": {
        "nested": {
          "nested": {
            "path": "events"
          },
          "aggs": {
            "filter": {
              "filter": {
                "match": { "events.description": "xyz" }
              },
              "aggs": {
                "total": {
                  "top_hits": {
                    "size": 10
                  }
                }
              }
            }
          }
        }
      }
    }
    

    Search Result :

    "hits": [
              {
                "_index": "foo21",
                "_type": "_doc",
                "_id": "1",
                "_nested": {
                  "field": "events",
                  "offset": 1
                },
                "_score": 1.0,
                "_source": {
                  "event_id": "9999",
                  "start_date": "2020-07-01 14:00:00",
                  "registration_count": "22",
                  "description": "xyz"
                }
              }
            ]
    

    Second Method :

    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "service_id": "20087"
              }
            },
            {
              "nested": {
                "path": "events",
                "query": {
                  "bool": {
                    "must": [
                      {
                        "match": {
                          "events.description": "xyz"
                        }
                      }
                    ]
                  }
                },
                "inner_hits": {
                  
                }
              }
            }
          ]
        }
      }
    }
    

    You can even go through this SO answer:

    How to filter nested aggregation bucket?

    Returning a partial nested document in ElasticSearch