Search code examples
elasticsearchelastic-stack

How to sum the size of documents within a time interval?


I'm attempting to estimate the sum of size of n documents across an index using below query :

GET /events/_search
{
  "query": {
            "bool":{
                    "must": [
                        {"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
                    ]
                }
        },
  "aggs": {
    "total_size": {
"sum": {
        "field": "doc['_source'].bytes"
      }
    }
  }
}

This returns documents but the size of the aggregation is 0 :

  "aggregations" : {
    "total_size" : {
      "value" : 0.0
    }
  }

How to sum the size of documents within a time interval ?


Solution

  • The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.

    However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:

    GET /events/_search
    {
      "runtime_mappings": {
        "source.size": {
          "type": "double",
          "script": """
            def size = params._source.toString().length() * 8;
            emit(size);
          """
        }
      },
      "query": {
            "bool":{
                    "must": [
                        {"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
                    ]
                }
      },
      "aggs": {
        "size": {
          "sum": {
            "field": "source.size"
          }
        }
      }
    }
    

    Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.