java elasticsearch memory-leaks garbage-collection

Search thread_pool for particular nodes always at maximum

I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.

This is my current search thread_pool:

node_name               name   active rejected  completed
1105-IDC.node          search      0 19295154 1741362188
1108-IDC.node          search      0  3362344 1660241184
1103-IDC.node          search     49 28763055 1695435484
1102-IDC.node          search      0  7715608 1734602881
1106-IDC.node          search      0 14484381 1840694326
1107-IDC.node          search     49 22470219 1641504395

Something I have noticed is two nodes always have max active threads(1103-IDC.node & 1107-IDC.node ). Even though other nodes also have rejections, these nodes have the highest. Hardware is similar to other nodes. What could be the reason for this? Can it be due to them having any particular shards where hits are more? If yes how to find them.?

Also, the young heap is taking more than 70ms(sometimes around 200ms) on the nodes where active thread is always max. Find below some lines from the GC log:

[2020-10-27T04:32:14.380+0000][53678][gc             ] GC(6768757) Pause Young (Allocation Failure) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start       ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc             ] GC(6768758) Pause Young (Allocation Failure) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start       ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc             ] GC(6768759) Pause Young (Allocation Failure) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start       ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc             ] GC(6768760) Pause Young (Allocation Failure) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start       ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc             ] GC(6768761) Pause Young (Allocation Failure) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start       ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc             ] GC(6768762) Pause Young (Allocation Failure) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start       ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc             ] GC(6768763) Pause Young (Allocation Failure) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start       ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc             ] GC(6768764) Pause Young (Allocation Failure) 28015M->26516M(51008M) 251.447ms

Solution

One important thing to note is that, if you got these stats from elasticsearch threadpool cat API then it shows just the point-in-time data and doesn't show the historical data for the last 1 hr, 6 hr, 1 day, 1 week like that.

And rejected and completed is the stats from the last restart of the nodes, so this is also not very helpful when we are trying to figure out if some of ES nodes are becoming hot-spots due to bad/unbalanced shards configuration.

So here we have two very important things to figure out

Make sure, we know the actual hotspot nodes in the cluster by looking at the average active, rejected requests on data nodes by time range(you can just check for peak hours).
Once hotspot nodes are known, look at the shards allocated to them, and compare it to other nodes shards, few metric to check is, number of shards, shards receive more traffic, shards receive slowest queries, etc and again most of them you have to figure out by looking at various metrics and API of ES which can be very time consuming and requires a lot of internal ES knowledge.