Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.

Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:

$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   345  100   345    0     0      2      0  0:02:52  0:02:47  0:00:05    90
id                     active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA      1        0   4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg      1        0   4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ      5        0   4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w      1        0   4592766 elastic7-3

How can I debug this / solve this? Thanks in advance.

Solution

If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).

You can have a look at threadpools in elasticsearch for further read.