java go elasticsearch timeout elasticsearch-dsl

Why there is much time cost between took_millis and timeout in elasticsearch slow query

I meet some slow queries in the production environment and I config the slow log to find some slow queries info like this(query_string with 500ms timeout):

[2021-06-21T10:43:33,930][DEBUG][index.search.slowlog.query.xxxx] [xxx][g][2] took[1s], took_millis[1043], total_hits[424690], types[top], stats[], search_typ      e[QUERY_THEN_FETCH], total_shards[6], source[{query_string}]

In this case, the query timeout is 500ms, and the took_millis in response is 1043ms. As far as I know the timeout is only useful for the query parse and the took value represents the execution time in es without some external phases like Query timing: ‘took’ value and what I’m measuring. I have two questions:

Firstly, why there is 504ms(1043 - 500 = 504) between the timeout and took_millis?
Secondly, how can I know the detail time cost between the timeout and took_millis time?

Thanks a lot!

Solution

Setting a timeout on a query doesn't ensure that the query is actually cancelled when its execution time surpasses that timeout. The Elasticsearch documentation states:

"By default, a running search only checks if it is cancelled or not on segment boundaries, therefore the cancellation can be delayed by large segments."

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search.html#global-search-timeout

Check issues 3627, 4586 and 2929

This can explain the 504ms between timeout and took_millis, your query just takes that long, and it is not cancelled in time.

To analyze the query execution and see what might be causing these long delays, you can rerun the query with the profile API. Note that if the slow execution of your query cannot be reproduced, this won't help you solve the issue. If your query runs fine most of the time, try to correlate these slow-running queries with external factors such as server load.