elasticsearch

Further aggregate date histogram in Elasticsearch query


I'm trying to extract, for each event_type, in how many days the event occurred and the last occurrence. Currently I wrote the following query:

{
    "size": 0,
    "query": { ... },
    "_source": false,
    "aggregations": {
        "events": { 
            "terms": { "field": "event_type" },
            "aggs": {
                "event_timestamp": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "calendar_interval": "day",
                        "min_doc_count": 1,
                        "order": { "_key": "desc" }
                    }
                }
            }
        }
    }
}

that extracts data to be further transformed. In my case, I just need, for each event_type, the key of the first event_timestamp bucket and the number of event_timestamp buckets. This can be done easily in a second step after this extraction, but it would be cleaner to let Elastic do it.

Max bucket (or metric) aggregation does not seems helpful, as I don't see a suitable buckets_path in the date_histogram aggregation.


Solution

  • Good start!!

    Here is a query that returns you the latest timestamp per event_type as well as the number of daily buckets for each event_type (for this you can use the bucket_script pipeline aggregation which simply returns the number of buckets in the date_histogram aggregation).

    Also note that the filter_path query string parameter allows you to only return what matters to you, i.e. you can get rid of the date histogram that you're not interested in:

    POST test/_search?filter_path=**.latest_timestamp.hits.hits._source,**.bucket_count,**.events.buckets.key
    {
      "size": 0,
      "_source": false,
      "query": {...},
      "aggregations": {
        "events": {
          "terms": {
            "field": "event_type"
          },
          "aggs": {
            "latest_timestamp": {
              "top_hits": {
                "size": 1,
                "sort": {"@timestamp": "desc"}, 
                "_source": ["@timestamp"]
              }
            },
            "event_timestamp": {
              "date_histogram": {
                "field": "@timestamp",
                "calendar_interval": "day",
                "min_doc_count": 1,
                "order": {
                  "_key": "desc"
                }
              }
            },
            "bucket_count": {
              "bucket_script": {
                "buckets_path": {
                  "count": "event_timestamp._bucket_count"
                },
                "script": "params.count"
              }
            }
          }
        }
      }
    }
    

    You'll get for each event type something that looks like this:

        {
          "key": "2",                            <--- event type
          "latest": {
            "hits": {
              "hits": [
                {
                  "_source": {
                    "@timestamp": "2023-01-05"   <--- latest timestamp
                  }
                }
              ]
            }
          },
          "bucket_count": {
            "value": 4                           <--- bucket count
          }
        },