Search code examples
elasticsearchhistogramelasticsearch-5elasticsearch-aggregation

Histogram is not starting at the right min even filter added


The Mapping

          "eventTime": {
            "type": "long"
          },

The Query

POST some_indices/_search
{
  "size": 0,
  "query": {
    "constant_score": {
      "filter": {
            "range": {                   
              "eventTime": {
                "from": 1563120000000,
                "to": 1565712000000,
                "format": "epoch_millis"
              }
        }
      }
    }
  },
  "aggs": {
      "min_eventTime": { "min" : { "field": "eventTime"} }, 
      "max_eventTime": { "max" : { "field": "eventTime"} }, 
      "time_series": {
        "histogram": {
          "field": "eventTime",
          "interval": 86400000, 
          "min_doc_count" : 0,
          "extended_bounds": {            
            "min": 1563120000000,
            "max": 1565712000000
          }
        }
      }
  }
}

The Response

"aggregations": {
    "max_eventTime": {
      "value": 1565539199997
    },
    "min_eventTime": {
      "value": 1564934400000
    },
    "time_series": {
      "buckets": [
        {
          "key": 1563062400000,
          "doc_count": 0
        },
        {
          "key": 1563148800000,
          "doc_count": 0
        },
        {
        ...

Question

As the reference clearly mentioned

For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.

I set the filter properly (as the demo does) and the min and max is also providing the evidence.

But why still the first key is SMALLER THAN than the from (or min_eventTime)?

So weird and I totally get lost now ;(

Any advice will be appreciated ;)

References


Solution

  • I hacked out a solution for now, but I kind of think it's a bug in Elastic Search.

    I am using date_histogram instead though the field itself is a long type and via offset I moved the starting point forward to the right timestamp.

      "aggs": {
        "time_series": {
          "date_histogram": {
            "field": "eventTime",
            "interval": 86400000,
            "offset": "+16h",
            "min_doc_count": 0,
            "extended_bounds": {
              "min": 1563120000000,
              "max": 1565712000000
            }
          },
          "aggs": {
            "order_amount_total": {
              "sum": {
                "field": "order_amount"
              }
            }
          }
        }
      }
    

    Updated

    Thanks for the help of @Val, I re-think about it and have a test as follows:

        @Test
        public void testComputation() {
            System.out.println(1563120000000L  % 86400000L); // 57600000
            System.out.println(1563062400000L  % 86400000L); // 0
        }
    

    I want to quote from the doc

    With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).

    But I believe the specific min value should be one of 0, interval, 2 * interval, 3 * interval, .... instead of a random value as I used in the question.

    So basically in my case, I could use offset of histogram to solve the issue as follows.

    I don't actually need date_histogram at all.

           "histogram": {
              "field": "eventTime",
              "interval": 86400000, 
              "offset": 57600000,
              "min_doc_count" : 0,
              "extended_bounds": {            
                "min": 1563120000000,
                "max": 1565712000000
              }
            }
    

    A clear explanation posted by Elastic Search member @polyfractal (thank you for the detailed crystal explanation) is also proving the same logic, more details could be found here.

    A reason for the design I want to quote here:

    if we cut the aggregation off right at the extended_bounds.min/max, we would generate buckets that are not the full interval and that would break many assumptions about how the histogram works.