Search code examples
elasticsearchelasticsearch-aggregationelasticsearch-scripting

Elasticsearch: Aggregation For Random Fields


enter image description here

Now I have a document like the picture. The Structure of this document is "contents" field with many random key field(Notice that there isn't a fixed format for keys.They may just be like UUIDs ). I want to find the maximum value of start_time for all keys in "contents" with ES query. What can I do for this? The document:

{"contents": {
    "key1": {
        "start_time": "2020-08-01T00:00:19.500Z",
        "last_event_published_time": "2020-08-01T23:59:03.738Z",
        "last_event_timestamp": "2020-08-01T23:59:03.737Z",
        "size": 1590513,
        "read_offset": 1590513,
        "name": "key1_name"
    },
    "key2": {
        "start_time": "2020-08-01T00:00:19.500Z",
        "last_event_published_time": "2020-08-01T23:59:03.738Z",
        "last_event_timestamp": "2020-08-01T23:59:03.737Z",
        "size": 1590513,
        "read_offset": 1590513,
        "name": "key2_name"
    }
}}

I have tried Joe's solution and it works. But when I modify the document like:

{
"timestamp": "2020-08-01T23:59:59.359Z",
"type": "beats_stats",
"beats_stats": {
    "metrics": {
        "filebeat": {
            "harvester": {
                "files": {
                    "d47f60db-ac59-4b51-a928-0772a815438a": {
                        "start_time": "2020-08-01T00:00:18.320Z",
                        "last_event_published_time": "2020-08-01T23:59:03.738Z",
                        "last_event_timestamp": "2020-08-01T23:59:03.737Z",
                        "size": 1590513,
                        "read_offset": 1590513,
                        "name": "/data/logs/galogs/ga_log_2020-08-01.log"
                    },
                    "e47f60db-ac59-4b51-a928-0772a815438a": {
                        "start_time": "2020-08-01T00:00:19.500Z",
                        "last_event_published_time": "2020-08-01T23:59:03.738Z",
                        "last_event_timestamp": "2020-08-01T23:59:03.737Z",
                        "size": 1590513,
                        "read_offset": 1590513,
                        "name": "/data/logs/galogs/ga_log_2020-08-01.log"
                    }
                }
            }
        }
    }
}}

It goes wrong:

"error" : {
"root_cause" : [
  {
    "type" : "script_exception",
    "reason" : "runtime error",
    "script_stack" : [
      "for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n            ",
      "                                                                               ^---- HERE"
    ],
    "script" : "\n          for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n            state.start_millis_arr.add(\n              Instant.parse(entry.start_time).toEpochMilli()\n            );\n          }\n        ",
    "lang" : "painless"
  }
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
  {
    "shard" : 0,
    "index" : "agg-test-index-1",
    "node" : "B4mXZVgrTe-MsAQKMVhHUQ",
    "reason" : {
      "type" : "script_exception",
      "reason" : "runtime error",
      "script_stack" : [
        "for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n            ",
        "                                                                               ^---- HERE"
      ],
      "script" : "\n          for (def entry : params._source['beats_stats.metrics.filebeat.harvester.files'].values()) {\n            state.start_millis_arr.add(\n              Instant.parse(entry.start_time).toEpochMilli()\n            );\n          }\n        ",
      "lang" : "painless",
      "caused_by" : {
        "type" : "null_pointer_exception",
        "reason" : null
      }
    }
  }
]}

Solution

  • You can use a scripted_metric to calculate those. It's quite onerous but certainly possible.

    Mimicking your index & syncing a few docs:

    POST myindex/_doc
    {"contents":{"randomKey1":{"start_time":"2020-08-06T11:01:00.515Z"}}}
    
    POST myindex/_doc
    {"contents":{"35431fsf31_s35dfas":{"start_time":"2021-08-06T11:01:00.515Z"}}}
    
    POST myindex/_doc
    {"contents":{"999bc_123":{"start_time":"2019-08-06T11:01:00.515Z"}}}
    

    Get the max date of unknown random sub-objects:

    GET myindex/_search
    {
      "size": 0,
      "aggs": {
        "max_start_date": {
          "scripted_metric": {
            "init_script": "state.start_millis_arr = [];",
            "map_script": """
              for (def entry : params._source['contents'].values()) {
                state.start_millis_arr.add(
                  Instant.parse(entry.start_time).toEpochMilli()
                );
              }
            """,
            "combine_script": """
              // sort in-place
              Collections.sort(state.start_millis_arr, Collections.reverseOrder());
              return DateTimeFormatter.ISO_INSTANT.format(
                Instant.ofEpochMilli(
                  // first is now the highest
                  state.start_millis_arr[0]
                )
              );
    
            """,
            "reduce_script": "return states"
          }
        }
      }
    }
    

    BTW: @Sahil Gupta's comment is right -- never use images where pasting the text is possible (and helpful).