Search code examples
elasticsearchelasticsearch-5kibana-5

Elasticsearch: get top nested doc per month without top level duplicates


I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).

A document is structured as follows:

   {
       _id: "xxx",
       @timestamp: 1508244365987,
       siteURL: "www.foo.bar",
       plugins: [
           {
               name: "foo",
               version: "3.1.4"
           },
           {
               name: "baz",
               version: "13.37"
           }
      ]
   }

However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing. I tried to solve this by using the following aggregation:

{
  "aggs": {
    "normal_dates": {
      "date_range": {
        "field": "@timestamp",
        "ranges": [
          {
            "from": "now-1y/d",
            "to": "now"
          }
        ]
      },
      "aggs": {
        "date_histo": {
          "date_histogram": {
          "field": "@timestamp",
          "interval": "month"
        },
        "aggs": {
          "top_sites": {
            "terms": {
              "field": "siteURL.keyword",
              "size": 50000
              },
            "aggs": {
              "top_plugin_hits": {
                "top_hits": {
                  "sort": [
                    {
                      "@timestamp": {
                        "order": "desc"
                      }
                    }
                  ],
                  "_source": {
                    "includes": [
                      "plugins.name"
                    ]
                  },
                  "size": 1
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.

However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.

Any help would be most welcome

PS: extra kudos if I can get it in a Kibana 5.3 table...


Solution

  • It turns out it is not possible to further aggregate on a top_hits query.

    I ended up loading the results of the posted query into Python and used Python for further processing and visualization.