I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.
This is my current search thread_pool:
node_name name active rejected completed
1105-IDC.node search 0 19295154 1741362188
1108-IDC.node search 0 3362344 1660241184
1103-IDC.node search 49 28763055 1695435484
1102-IDC.node search 0 7715608 1734602881
1106-IDC.node search 0 14484381 1840694326
1107-IDC.node search 49 22470219 1641504395
Something I have noticed is two nodes always have max active threads(1103-IDC.node & 1107-IDC.node ). Even though other nodes also have rejections, these nodes have the highest. Hardware is similar to other nodes. What could be the reason for this? Can it be due to them having any particular shards where hits are more? If yes how to find them.?
Also, the young heap is taking more than 70ms(sometimes around 200ms) on the nodes where active thread is always max. Find below some lines from the GC log:
[2020-10-27T04:32:14.380+0000][53678][gc ] GC(6768757) Pause Young (Allocation Failure) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc ] GC(6768758) Pause Young (Allocation Failure) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc ] GC(6768759) Pause Young (Allocation Failure) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc ] GC(6768760) Pause Young (Allocation Failure) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc ] GC(6768761) Pause Young (Allocation Failure) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc ] GC(6768762) Pause Young (Allocation Failure) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc ] GC(6768763) Pause Young (Allocation Failure) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc ] GC(6768764) Pause Young (Allocation Failure) 28015M->26516M(51008M) 251.447ms
One important thing to note is that, if you got these stats from elasticsearch threadpool cat API then it shows just the point-in-time data and doesn't show the historical data for the last 1 hr, 6 hr, 1 day, 1 week like that.
And rejected and completed is the stats from the last restart of the nodes, so this is also not very helpful when we are trying to figure out if some of ES nodes are becoming hot-spots due to bad/unbalanced shards configuration.
So here we have two very important things to figure out