Search code examples
elasticsearchdate-histogram

Let document match multiple buckets of date histogram


I have an index that has a mapping which is similar to

{
    "id": {
        "type": "long"
    },
    "start": {
        "type": "date"
    },
    "end": {
        "type": "date"
    }
}

I want to create a date histogram so that each document falls into all buckets which intervals fall between "start" and "end".

Eg. if for one document "start" = 12/01/2018, "end" = 04/25/2019, my date-histogram interval are weeks and the range is now-1y until now. I now want the document to fall into every bucket starting the week of 12/01/2018 until the week of 04/25/2019. So with just this one document the result should be 52 buckets where the buckets April to dezember have doc_count 0 and the buckets Dezember to April have doc_count 1.

As I see it date-histogram only gives me the option to match my document to exactly one bucket depending on one field, either "start" or "end".

What I have tried so far:

  1. Dynamically generate a query with 52 filters which checks if a document falls into this "bucket"
  2. Try to make use of painless scripts in each query

Both solutions were extremly slow. I am working with around 200k documents and such queries took around 10 seconds.

EDIT: Here is a sample query that is generated dynamically. As can be seen one filter is created per week. This query takes about 10 seconds which is way to long

%{
  aggs: %{
    count_chart: %{
      aggs: %{
        last_seen_over_time: %{
          filters: %{
            filters: %{
              "2018-09-24T00:00:00Z" => %{
                bool: %{
                  must: [
                    %{range: %{start: %{lte: "2018-09-24T00:00:00Z"}}},
                    %{range: %{end: %{gte: "2018-09-17T00:00:00Z"}}}
                  ]
                }
              },
              "2018-12-24T00:00:00Z" => %{
                bool: %{
                  must: [
                    %{range: %{start: %{lte: "2018-12-24T00:00:00Z"}}},
                    %{range: %{end: %{gte: "2018-12-17T00:00:00Z"}}}
                  ]
                }
              },
              "2019-04-01T00:00:00Z" => %{
                bool: %{
                  must: [
                    %{range: %{start: %{lte: "2019-04-01T00:00:00Z"}}},
                    %{range: %{end: %{gte: "2019-03-25T00:00:00Z"}}}
                  ]
                }
              }, ...
          }
      }
    }
  },
  size: 0
}

And a sample response:

%{
  "_shards" => %{"failed" => 0, "skipped" => 0, "successful" => 5, "total" => 5},
  "aggregations" => %{
    "count_chart" => %{
      "doc_count" => 944542,
      "last_seen_over_time" => %{
        "buckets" => %{
          "2018-09-24T00:00:00Z" => %{"doc_count" => 52212},
          "2018-12-24T00:00:00Z" => %{"doc_count" => 138509},
          "2019-04-01T00:00:00Z" => %{"doc_count" => 119634},
          ...
        }
      }
    }
  },
  "hits" => %{"hits" => [], "max_score" => 0.0, "total" => 14161812},
  "timed_out" => false,
  "took" => 2505
}

I hope this question is understandable. If not I will explain it more in detail.


Solution

  • How about doing 2 date_histogram query and calculating the difference per week? I'm assuming you just need the overall count due to size:0 in your query.

        let start = await client.search({
            index: 'dates',
            size: 0,
            body: {
                "aggs" : {
                    "start": {
                        "date_histogram": {
                            "field": "start",
                            "interval": "week"
                        },
                    }
                }
            }
        });
    
        let end = await client.search({
            index: 'dates',
            size: 0,
            body: {
                "aggs" : {
                    "end": {
                        "date_histogram": {
                            "field": "end",
                            "interval": "week"
                        },
                    }
                }
            }
        });
    
       let buckets = {};
       let start_buckets = start.aggregations.start.buckets;
       let end_buckets = end.aggregations.start.buckets;
       let started = 0;
       let ended = 0;
       for (let i = 0; i < start_buckets.length; i++) {
           started += start_buckets[i].doc_count;
           buckets[start_buckets[i].key_as_string] = started - ended;
           ended += end_buckets[i].doc_count;
       }
    

    This test took less than 2 seconds on my local on similar scale to yours.

    You can run both aggregations simultaneously to save more time.