Search code examples
elasticsearchelasticsearch-dsl

Elasticsearch: How to get list of devices that haven't communicated recently?


I'm trying to get a list of devices that have not communicated in the last 30 minutes. I think my search needs to be:

  1. Separate index by name of device.
  2. Get latest record for each device.
  3. List records that are older than 30 mins.

The first two I can achieve by collapsing the index by the enum of the device Name, then sorting them by time and only returning the latest:

GET /<index>/_search
{
  "collapse": {
    "field": "Name.enum",
    "inner_hits": {
      "name": "most_recent",
      "size": 1,
      "sort": [{
        "Time": {
          "order": "desc"
        }
      }]
    }
  }
}

But now want to query which of these is over 30 mins old:

  "query": {
    "bool": {
      "must_not": {
        "range": {
          "Time": {
            "gt": "now-35m"
          }
        }
      }
    }
  }

How can I run the query on the result of the sort after the collapse?


Solution

  • I would do it differently, not by using field collapsing, but using a terms aggregation instead, which gives you more flexibility.

    In the aggregation query below, you can see that we're aggregating the devices names, then we're taking the most recent record (using a max metric aggregation) and finally, we're weeding out all buckets whose most recent record is older than 35 minutes ago by leveraging the bucket_selector pipeline aggregation:

    POST <index>/_search
    {
      "size": 0,
      "aggs": {
        "devices": {
          "terms": {
            "field": "Name.enum"
          },
          "aggs": {
            "most_recent": {
              "max": {
                "field": "Time"
              }
            },
            "older_than_30_min": {
              "bucket_selector": {
                "buckets_path": {
                  "most_recent": "most_recent"
                },
                "script": "params.most_recent > new Date().getTime() - 35 * 60000"
              }
            }
          }
        }
      }
    }
    

    If you're on recent versions (8.11+), you can also leverage the newly released and very promising ES|QL language which makes that kind of queries much easier:

    POST _query
    {
      "query": """
      from index
      | stats most_recent = max(Time) by Name
      | where most_recent > NOW() - 35 minutes
      | keep Name, most_recent
      | limit 100
      """
    }
    

    The results look something like this

    {
      "columns": [
        {
          "name": "Name",
          "type": "keyword"
        },
        {
          "name": "most_recent",
          "type": "date"
        }
      ],
      "values": [
        [
          "John Doe",
          "2024-04-13T04:25:10.000Z"
        ]
      ]
    }