Client Connection Pool gets into an unrecoverable state

We have been having some issues where we see that some pods for service are having problems with a client connection pool. Suddenly the connections in the pool start growing and we end up getting errors when trying to get a new connection. It's very easy to see on a graph of used connections in the connection pool, it suddenly starts to grow and gets a lot more volatile.

The error thrown is:

Connect Error: Too many outstanding acquire operations
at io.micronaut.http.client.netty.ConnectionManager.lambda$connectForExchange$6(ConnectionManager.java:480)
at io.micronaut.http.client.netty.ConnectionManager.lambda$addInstrumentedListener$16(ConnectionManager.java:907)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:557)
at io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:503)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [1 skipped]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.IllegalStateException: Too many outstanding acquire operations
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [4 skipped]
at io.netty.util.concurrent.PromiseTask.run(PromiseTask.java:106)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [1 skipped]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:833)
... 4 common frames omitted

When this happens the only way to solve it is to restart the pods. Then the issues go away on the graph for used connections in the connection pool to become less volatile.

This has happened to more than one service we have but the one I am looking at now integrates with an external API that not always behaves perfectly. There are read timeouts, and different 5XX responses at times. We also believe that it at times scales down and some IPs that have been associated with the DNS might go away. That is just a theory though. So we have looked at java DNS caching and we are at the default which should be 30 seconds cache for successful lookups.

Other times it is often that this is preceded by a request spike.

This is part of the services micronaut configuration that seems relevant:

micronaut.http.client.read-timeout=PT6S
micronaut.http.client.read-idle-timeout=PT6S
micronaut.http.client.connect-timeout=PT2S
micronaut.http.client.connect-ttl=PT60S
micronaut.http.client.pool.enabled=true
micronaut.http.client.pool.max-connections=50
micronaut.http.client.pool.max-pending-acquires=50
micronaut.http.client.pool.acquire-timeout=PT10S
micronaut.http.client.connection-pool-idle-timeout=60s

Micronuat version is 3.8.4.

The feeling we are having is that something changes that causes some connections in the pool not to work and new ones are starting to get allocated and the ones that are not working are not getting reaped/killed. So the connection pool gets full and can never really recover without a restart.

What I am looking for here is any ideas on how we can go about finding the problem. We have found very little info on how to look into the connection pool and see what is happening. Or if there are some configuration problems or missing configuration? Or if any similar problem is know in Micronaut or Netty.

I have not put in a bug in the Micronaut project on github as I do not have a clear way of reproducing it.

Solution

Micronuat team found this bug, see this for more information: https://github.com/micronaut-projects/micronaut-core/issues/9545