Search code examples
elasticsearchelasticsearch-aggregationelasticsearch-dsl

How do I aggregate slightly different data in Elasticsearch?


There is a request with which you can calculate the percentiles of the request duration to the endpoint /api/v1/blabla

    POST /filebeat-nginx-*/_search
    {
      "aggs": {
        "hosts": {
          "terms": {
            "field": "host.name",
            "size": 1000
          },
          "aggs": {
            "url": {
              "terms": {
                "field": "nginx.access.url",
                "size": 1000
              },
              "aggs": {
                "time_duration_percentiles": {
                  "percentiles": {
                    "field": "nginx.access.time_duration",
                    "percents": [
                      50,
                      90
                    ],
                    "keyed": true
                  }
                }
              }
            }
          }
        }
      },
      "size": 0,
      "query": {
        "bool": {
          "filter": [
            {
              "bool": {
                "should": [
                  {
                    "prefix": {
                      "nginx.access.url": "/api/v1/blabla" 
                    }
                  }
                ]
              }
            },
            {
              "range": {
                "@timestamp": {
                  "gte": "now-10m",
                  "lte": "now" 
                }
              }
            }
          ]
        }
      }
    }

There is a problem with the fact that some arguments are also passed to this endpoint, for example /api/v1/blabla?Lang=en&type=active, or /api/v1/blabla/?Lang=en&type=istory, etc. Accordingly, the answer shows the percentiles for each such "separate" endpoint:

    {
      "key" : "/api/v1/blabla?lang=ru",
      "doc_count" : 423,
      "time_duration_percentiles" : {
        "values" : {
          "50.0" : 0.21199999749660492,
          "90.0" : 0.29839999079704277
        }
      }
    },
    {
      "key" : "/api/v1/blabla?lang=en&type=active",
      "doc_count" : 31,
      "time_duration_percentiles" : {
        "values" : {
          "50.0" : 0.21699999272823334,
          "90.0" : 0.2510000020265579
        }
      }
    },
    {
      "key" : "/api/v1/blabla?lang=en",
      "doc_count" : 4,
      "time_duration_percentiles" : {
        "values" : {
          "50.0" : 0.22700000554323196,
          "90.0" : 0.24899999797344208
        }
      }
    }

Please tell me is it possible to somehow aggregate similar endpoints into only one /api/v1/blabla and get the general percentile?

Like this:

    {
      "key" : "/api/v1/blabla",
      "doc_count" : 4,
      "time_duration_percentiles" : {
        "values" : {
          "50.0" : 0.22700000554323196,
          "90.0" : 0.24899999797344208
        }
      }
    }

Solution

  • You could try splitting the nginx.access.url in a script but keep in mind that it'll probably be slow:

    {
      "aggs": {
        "hosts": {
          "terms": {
            "field": "host.name",
            "size": 1000
          },
          "aggs": {
            "url": {
              "terms": {
                "script": {
                  "source": "/\\?/.split(doc['nginx.access.url'].value)[0]"       <--- here
                }, 
                "size": 1000
              },
              "aggs": {
                "time_duration_percentiles": {
                  "percentiles": {
                    "field": "nginx.access.time_duration",
                    "percents": [
                      50,
                      90
                    ],
                    "keyed": true
                  }
                }
              }
            }
          }
        }
      },
      ...
    }
    

    BTW it's good practice to extract the URI hostname, path, query string etc. before you index your docs. You can do so through the URI parts pipeline and other mechanisms.