java json elasticsearch abstract-syntax-tree

How to identify expensive queries by the Query DSL?

I have a requirement in my application: to identify expensive elasticsearch queries in the application.

I only know there's Query DSL for elasticsearch. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

I need to identify each elasticsearch query in the reverse proxy for elasticsearch (the reverse proxy is developed in java, just to throttle the requests to ES and do some user statistics), if it's expensive query, only limited users can perform at a specific rate limit.

What is difficult to me is how to identify the expensive queries. I know that there is a switch for elasticsearch, can disable / enable the expensive queries by setting this parameter. I read elasticsearch source code, but I cannot find how the elasticsearch identify different kinds of expensive queries.

If you know:

Is there any elasticsearch API (from elasticsearch client sdk) that can identify expensive queries ? Then I can invoke the API directly in my application.
If not, do you know what's the effective way to identify expensive queries by analysis the query body ? by some AST (Abstract Syntax Tree) resolver ? Or by search specific keywords in the query body ?

I'd really appreciate some help on this!

Solution

There isnt a good 'native' way to do it in Elasticsearch, but you do have some options that might help

Setting timeout or terminate_after

This option looks at your requirement from a different perspective.

From Elasticsearch docs: search-your-data

You could save records of the amount of time each query, performed by the user, took by looking at the took field returned in the result.

{
  "took": 5,
  "timed_out": false,
...
}

This way you have a record of how many queries a user performed in a time-windows that were 'expansive' (took more then X ).

For that user, you can start adding the timeout or terminate_after params that will try to limit the query execution. this wont prevent the user from performing an expansive query, but it will try to cancel long running queries after 'timeout' has expired, returning a partial or empty result back to the user.

GET /my-index-000001/_search
{
  "timeout": "2s",
  "query": {
    "match": {
      "user.id": "kimchy"
    }
  }
}

This will limit the affect of the expansive queries on the cluster, performed by that user.

a side-note; this stackoverflow answer states that there are certain queries that can still bypass the timeout/terminate_after flags, such as script.

terminate_after limits the amount of documents searched though on each of the shards, this might is an alternative option to be used, or even another backup if timeout is too high or ignored for some reason.

Long term analytics

This answer requires a lot more work probably, but you could save statistics on queries performed and the amount of time they took.

You should probably use the json representation of the queryDSL in this case, save them in an elasticsearch index along what the time that query took and keep aggregates of the average time similar queries take.

You could possibly use the rollup feature to pre-aggregate all the averages and check a query against this index if its a "possibly expansive query".

The problem here is which part of the query to save and which queries are "similar" enough to be considered for this aggregation.

Searching for keywords in the query

You stated this as an option as well. the DSL query in the end translates to a REST call with JSON body, so using a JsonNode you could look for specific sub-elements that you 'think' will make the query expansive and even limit things like 'amount of buckets' etc.

Using ObjectMapper you could write the query into a string and just look for keywords, this would be the easiest solution.

There are specific features that we know require a lot of resources from Elasticsearch and can potentially take a long time to finish, so these could be limited through this answer as a "first defense".

Examples: Highlighting Scripts search_analyzers etc...

So although this answer is the most naive, it could be a fast win while you work on a long term solution that requires analytics.