elasticsearch elastic-stack elasticsearch-plugin elasticsearch-aggregation

Elasticsearch: QueueResizingEsThreadPoolExecutor exception

At some point during 35k endurance load test of my java web app which fetches static data from Elasticsearch, I am start getting following elasticsearch exception:

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@1a25fe82 on QueueResizingEsThreadPoolExecutor[name = search4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 10.7ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6312a0bb[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 34575035]]

Elasticsearch details:

Elasticsearch version 6.2.4.

The cluster consists of 5 nodes. The JVM heap size setting for each node is Xms16g and Xmx16g. Each of the node machine has 16 processors.

NOTE: Initially, when I got this exception for the first time, I decided to increase the thread_pool.search.queue_size parameter in elasticsearch.yml, set it to 10000. Yes, I understand, I just postponed the problem to happen later.

Elasticsearch indicies details: Currently, there are about 20 indicies, and only 6 are being used among of them. The unused one are old indicies that were not deleted after newer were created. The indexes itself are really small:

Index within the red rectangle is the index used by my web app. It's shards and replicas settings are "number_of_shards": "5" and "number_of_replicas": "2" respectively. It's shards details: In this article I found that

Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

As you can see from the screenshot above, my shard size is much less than mentioned size. Hence, Q: what is the right number of shards in my case? Is it 1 or 2? The index won't grow up much over the time.

ES Queries issued during the test. The load tests simulates scenario where user navigates to the page for searching some products. User can filter the products using corresponding filters (for e.g. name, city, etc...). The unique filter values is fetched from ES index using composite query. So this is the first query type. Another query is for fetching products from ES. It consists of must, must_not, filter, has_child queries, the size attribute equals 100.

I set the slow search logging, but nothing had been logged:

"index": {
                "search": {
                    "slowlog": {
                        "level": "info",
                        "threshold": {
                            "fetch": {
                                "debug": "500ms",
                                "info": "500ms"
                            },
                            "query": {
                                "debug": "2s",
                                "info": "1s"
                            }
                        }
                    }
                }

I feel like I am missing something simple to make it finally and being able to handle my load. Appreciate, if any one can help me with solving the issue.

Solution

For such a small size, you are using 5 primary shards, which I feel, due to your ES version 6.X(default was 5), and you never changed it, but In short having high number of primary shards for small index, has severe performance penalty, please refer very similar use-case(I was also having 5 PS 😀) which I covered in my blog.

As you already mentioned that your index size will not grow significantly in future, I would suggest to have 1 primary shard and 4 replica shards

1 Primary shard means for a single search, only one thread and one request will be created in Elasticsearch, this will provide better utilisation of resources.
As you have 5 data nodes, having 4 replica means shards are properly distributed on each data node, so your throughput and performance will be optimal.

After this change, measure the performance, and I am sure after this, you can again reduce the search queue size to 1k, as you know having high queue size is just delaying the problem and not addressing the issue at hand.

Coming to your search slow log, I feel you have very high threshold, for query phase 1 seconds for a query is really high for user-facing application, try to lower it ~100ms and not down those queries and optimize them further.