Search code examples
pythonelasticsearchelasticsearch-aggregationelasticsearch-py

Average bucket aggregation including 0 values


I am using ES to get statistics per day in a case where I do not have input data for everyday, but where missing data is also an information as such (in my case I am measuring an activity). To achieve such task I use the average bucket aggregation and I set the min_doc_count to 0 in order to have a value even when my data is missing. My query is the following :

query = {
 "size": 0,
   "query": {
        "bool" : {
            "should" :{
                "match" : { "LogEntryType" : "LogEntry" }    
            },
            "must" : {
                'match' : {'Id' : id}
            }
        }

    },
  "aggs": {
    "actions_per_day": {
      "date_histogram": {
        "field": "CreationTime",
        "interval": "day",
          "min_doc_count" : 0
      },
      "aggs": {
        "amount": {
          "value_count": {
            "field": "CreationTime"
          }
        }
      }
    },
    "avg_daily_action": {
      "avg_bucket": {
        "buckets_path": "actions_per_day>amount"
      }
    }
  }
 }
 es1.search(index=indx1, body=query)

And it does return objects with a value set to 0, as you can see in the following extract of the response :

{'took': 1,
'timed_out': False,
'_shards': {'total': 2, 'successful': 2, 'failed': 0},
'hits': {'total': 4321, 'max_score': 0.0, 'hits': []},
'aggregations': {'actions_per_day': {'buckets': [
{'key_as_string': '2018-07-13T00:00:00.000Z',
 'key': 1531440000000,
 'doc_count': 631,
 'amount': {'value': 631}},
{'key_as_string': '2018-07-14T00:00:00.000Z',
 'key': 1531526400000,
 'doc_count': 0,
 'amount': {'value': 0}}...

But my problem is that for the average daily action, the computation does not consider the 0 values, it returns :

...'avg_daily_action': {'value': 432.1}

Which corresponds to the average not taking into account the days when no input was given. I have found this issue, where it is mentioned that I should use the "missing" parameter, however I think that it is to retrieve elements with 0, and I already have them. Any clues on how to take them into account?


Solution

  • It turns out I was almost there. One must use the gap_policy parameter in the average bucket aggregation to take into account the 0 values. The average bucket part now becomes

     "avg_daily_operations": {
          "avg_bucket": {
            "buckets_path": "modifications_per_day>amount",
              "gap_policy" : "insert_zeros"
          }
     }
    

    The default value is set to skip, and changing it solves the issue.