Search code examples
keycloakinfinispan

Communicating Infinispan Remote Exceptions Generates Excessive Network Traffic


When an exception occurs in our Infinispan cluster (version 9.4.8.Final), the node that has the exception sends this information to other nodes in the cluster. This seems to be by design.

This activity can cause so much traffic that it causes timeout exceptions which, in turn, make the nodes want to communicate their timeout exceptions to the other nodes. In production, our 3 node Infinispan cluster completely saturated a 20 Gb/s link.

For example, in a 2 node QA cluster we observe the following:

Node 1:

ISPN000476: Timed out waiting for responses for request 7861 from node2

Node 2:

ISPN000217: Received exception from node1, see cause for remote stack trace

Further down the stack trace printed on node 2, we see:

Timed out waiting for responses for request 7861 from node2

There are a large number of these. We took a packet capture during this and can see that there are 50 KB packets containing a list of the remote errors along with their entire java stack traces.

When this happens, it is a "perfect storm" of sorts. Every timeout produces an error that gets sent over the network. This increases congestion and timeouts. From there things worsen extremely quickly.

I understand that I need to troubleshoot the timeout issue - looking for GC collection pauses, etc, and perhaps consider increasing the timeout. However, I'd like to know if there is a way to stop this behavior from happening when these events occur. When you think about it, it seems odd for node 1 to timeout talking to node 2 and then send node 2 a copy of the error over the network telling it 'I timed out talking to you'.

Is there any way to avoid the transmission of these remote stack traces? Many thanks in advance for any insight or advice.

EDIT

Example stack trace:

2019-12-06 11:37:01,587 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (default task-26) Uncaught server error: org.infinispan.remoting.RemoteException: ISPN000217: Received exception from ********, see cause for remote stack trace
        at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:28)
        at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
        at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
        at org.infinispan.remoting.transport.impl.SingleTargetRequest.receiveResponse(SingleTargetRequest.java:52)
        at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:35)
        at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:52)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1372)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1275)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$300(JGroupsTransport.java:126)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1420)
        at org.jgroups.JChannel.up(JChannel.java:816)
        at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:133)
        at org.jgroups.stack.Protocol.up(Protocol.java:340)
        at org.jgroups.protocols.FORK.up(FORK.java:141)
        at org.jgroups.protocols.FRAG3.up(FRAG3.java:171)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:339)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:872)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:240)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1008)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:734)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:389)
        at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:590)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:131)
        at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:203)
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:253)
        at org.jgroups.protocols.MERGE3.up(MERGE3.java:280)
        at org.jgroups.protocols.Discovery.up(Discovery.java:295)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1249)
        at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:87)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at org.jboss.as.clustering.jgroups.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:52)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: org.infinispan.util.logging.TraceException
                at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:41)
                at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:250)
                at org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1918)
                at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1433)
                at org.infinispan.cache.impl.DecoratedCache.put(DecoratedCache.java:685)
                at org.infinispan.cache.impl.DecoratedCache.put(DecoratedCache.java:240)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.infinispan.cache.impl.EncoderCache.put(EncoderCache.java:195)
                at org.infinispan.cache.impl.AbstractDelegatingCache.put(AbstractDelegatingCache.java:116)
                at org.keycloak.cluster.infinispan.InfinispanNotificationsManager.notify(InfinispanNotificationsManager.java:155)
                at org.keycloak.cluster.infinispan.InfinispanClusterProvider.notify(InfinispanClusterProvider.java:130)
                at org.keycloak.models.cache.infinispan.CacheManager.sendInvalidationEvents(CacheManager.java:206)
                at org.keycloak.models.cache.infinispan.UserCacheSession.runInvalidations(UserCacheSession.java:140)
                at org.keycloak.models.cache.infinispan.UserCacheSession$1.commit(UserCacheSession.java:152)
                at org.keycloak.services.DefaultKeycloakTransactionManager.commit(DefaultKeycloakTransactionManager.java:146)
                at org.keycloak.services.resources.admin.UsersResource.createUser(UsersResource.java:125)
                at sun.reflect.GeneratedMethodAccessor487.invoke(Unknown Source)
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                at java.lang.reflect.Method.invoke(Method.java:498)
                at org.jboss.resteasy.core.MethodInjectorImpl.invoke(MethodInjectorImpl.java:139)
                at org.jboss.resteasy.core.ResourceMethodInvoker.internalInvokeOnTarget(ResourceMethodInvoker.java:510)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTargetAfterFilter(ResourceMethodInvoker.java:400)
                at org.jboss.resteasy.core.ResourceMethodInvoker.lambda$invokeOnTarget$0(ResourceMethodInvoker.java:364)
                at org.jboss.resteasy.core.interception.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:355)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invokeOnTarget(ResourceMethodInvoker.java:366)
                at org.jboss.resteasy.core.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:338)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:137)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:106)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:132)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:106)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invokeOnTargetObject(ResourceLocatorInvoker.java:132)
                at org.jboss.resteasy.core.ResourceLocatorInvoker.invoke(ResourceLocatorInvoker.java:100)
                at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:439)
                at org.jboss.resteasy.core.SynchronousDispatcher.lambda$invoke$4(SynchronousDispatcher.java:229)
                at org.jboss.resteasy.core.SynchronousDispatcher.lambda$preprocess$0(SynchronousDispatcher.java:135)
                at org.jboss.resteasy.core.interception.PreMatchContainerRequestContext.filter(PreMatchContainerRequestContext.java:355)
                at org.jboss.resteasy.core.SynchronousDispatcher.preprocess(SynchronousDispatcher.java:138)
                at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:215)
                at org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.service(ServletContainerDispatcher.java:227)
                at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:56)
                at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:51)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:791)
                at io.undertow.servlet.handlers.ServletHandler.handleRequest(ServletHandler.java:74)
                at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:129)
                at org.keycloak.services.filters.KeycloakSessionServletFilter.doFilter(KeycloakSessionServletFilter.java:90)
                at io.undertow.servlet.core.ManagedFilter.doFilter(ManagedFilter.java:61)
                at io.undertow.servlet.handlers.FilterHandler$FilterChainImpl.doFilter(FilterHandler.java:131)
                at io.undertow.servlet.handlers.FilterHandler.handleRequest(FilterHandler.java:84)
                at io.undertow.servlet.handlers.security.ServletSecurityRoleHandler.handleRequest(ServletSecurityRoleHandler.java:62)
                at io.undertow.servlet.handlers.ServletChain$1.handleRequest(ServletChain.java:68)
                at io.undertow.servlet.handlers.ServletDispatchingHandler.handleRequest(ServletDispatchingHandler.java:36)
                at org.wildfly.extension.undertow.security.SecurityContextAssociationHandler.handleRequest(SecurityContextAssociationHandler.java:78)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.servlet.handlers.security.SSLInformationAssociationHandler.handleRequest(SSLInformationAssociationHandler.java:132)
                at io.undertow.servlet.handlers.security.ServletAuthenticationCallHandler.handleRequest(ServletAuthenticationCallHandler.java:57)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.security.handlers.AbstractConfidentialityHandler.handleRequest(AbstractConfidentialityHandler.java:46)
                at io.undertow.servlet.handlers.security.ServletConfidentialityConstraintHandler.handleRequest(ServletConfidentialityConstraintHandler.java:64)
                at io.undertow.security.handlers.AuthenticationMechanismsHandler.handleRequest(AuthenticationMechanismsHandler.java:60)
                at io.undertow.servlet.handlers.security.CachedAuthenticatedSessionHandler.handleRequest(CachedAuthenticatedSessionHandler.java:77)
                at io.undertow.security.handlers.NotificationReceiverHandler.handleRequest(NotificationReceiverHandler.java:50)
                at io.undertow.security.handlers.AbstractSecurityContextAssociationHandler.handleRequest(AbstractSecurityContextAssociationHandler.java:43)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at org.wildfly.extension.undertow.security.jacc.JACCContextIdHandler.handleRequest(JACCContextIdHandler.java:61)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at org.wildfly.extension.undertow.deployment.GlobalRequestControllerHandler.handleRequest(GlobalRequestControllerHandler.java:68)
                at io.undertow.server.handlers.PredicateHandler.handleRequest(PredicateHandler.java:43)
                at io.undertow.servlet.handlers.ServletInitialHandler.handleFirstRequest(ServletInitialHandler.java:292)
                at io.undertow.servlet.handlers.ServletInitialHandler.access$100(ServletInitialHandler.java:81)
                at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:138)
                at io.undertow.servlet.handlers.ServletInitialHandler$2.call(ServletInitialHandler.java:135)
                at io.undertow.servlet.core.ServletRequestContextThreadSetupAction$1.call(ServletRequestContextThreadSetupAction.java:48)
                at io.undertow.servlet.core.ContextClassLoaderSetupAction$1.call(ContextClassLoaderSetupAction.java:43)
                at org.wildfly.extension.undertow.security.SecurityContextThreadSetupAction.lambda$create$0(SecurityContextThreadSetupAction.java:105)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at org.wildfly.extension.undertow.deployment.UndertowDeploymentInfoService$UndertowThreadSetupAction.lambda$create$0(UndertowDeploymentInfoService.java:1502)
                at io.undertow.servlet.handlers.ServletInitialHandler.dispatchRequest(ServletInitialHandler.java:272)
                at io.undertow.servlet.handlers.ServletInitialHandler.access$000(ServletInitialHandler.java:81)
                at io.undertow.servlet.handlers.ServletInitialHandler$1.handleRequest(ServletInitialHandler.java:104)
                at io.undertow.server.Connectors.executeRootHandler(Connectors.java:364)
                at io.undertow.server.HttpServerExchange$1.run(HttpServerExchange.java:830)
                at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
                at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1982)
                at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
                at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1363)
                ... 1 more
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 7865 from ********
        at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
        at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
        at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

Solution

  • We were able to work around the issue that caused the massive spike in network traffic. Details are below.

    tl;dr:

    We switched from a JGroups UDP stack to TCP stack which the ISPN docs mentioned can be more efficient for small clusters using distributed caches like ours.

    Reproducing the Issue

    To reproduce the issue, we did the following:

    • Configure the Keycloak background job that purges ISPN cache entries to run every 60 seconds so we won't have to wait for the job to run on its 15 minute interval (standalone-ha.xml):

      <scheduled-task-interval>60</scheduled-task-interval>

    • Generate a large volume of user sessions (we used jmeter). In our tests, we ended up generating about 100,000 sessions.

    • Configure the SSO Idle and Max time TTL in Keycloak to be extremely short so that all the sessions will expire (60 seconds)
    • Continue to place load on the system with jmeter (this part is important)

    When the job to purge the cache would run, we would see the network flood with traffic (120 MB/s). When this happened, we'd see tons of the following errors on each node in the cluster:

    ISPN000476: Timed out waiting for responses for request 7861 from node2

    ISPN000217: Received exception from node1, see cause for remote stack trace

    Pro-Tip: Configure passivation to a file store to persist your ISPN data. Shut down your cluster and save the ".dat" files elsewhere. Use these files to instantly restore the state of your ISPN cluster in between tests.

    Fixing the Issue

    Using the technique above, we were able to reproduce the issue upon demand. As such, we set out to resolve it using the approaches described below.

    Change JGroups Stack to Use TCP

    We changed the JGroups stack from UDP to TCP and also configured TCPPing for discovery. We did this after reading the description for the TCP stack in the following guide:

    https://infinispan.org/docs/stable/titles/configuring/configuring.html#preconfigured_jgroups_stacks-configuring

    Specifically:

    "Uses TCP for transport and UDP multicast for discovery. Suitable for smaller clusters (under 100 nodes) only if you are using distributed caches because TCP is more efficient than UDP as a point-to-point protocol."

    This one change completely eliminated our issue

    Our Wildfly 16 config in standalone-ha.xml is as follows:

    <subsystem xmlns="urn:jboss:domain:jgroups:6.0">
            <channels default="ee">
                <channel name="ee" stack="tcp" cluster="ejb"/>
            </channels>
            <stacks>
                <stack name="udp">
                    <transport type="UDP" socket-binding="jgroups-udp"/>
                    <protocol type="PING"/>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="UFC"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <socket-protocol type="TCPPING" socket-binding="jgroups-tcp">
                      <property name="initial_hosts">HOST-X[7600],HOST-Y[7600],HOST-Z[7600]</property>
                      <property name="port_range">1</property>
                    </socket-protocol>
                    <protocol type="MERGE3"/>
                    <protocol type="FD_SOCK"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
            </stacks>
        </subsystem>
    

    Tune JVM Garbage Collector Arguments

    We followed some of the advice in the ISPN tuning guide:

    https://infinispan.org/docs/stable/titles/tuning/tuning.html

    In particular, we changed from JDK 8's default GC to using the CMS collector. Specifically, our Wildfly server now has the following JVM arguments supplied to it:

    -Xms6144m -Xmx6144m -Xmn1536M -XX:MetaspaceSize=192M -XX:MaxMetaspaceSize=512m -Djava.net.preferIPv4Stack=true -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+DisableExplicitGC
    

    Misc Changes

    We applied other changes particular to our environment:

    • Blocked IP Multicast + UDP at iptables level (since we wanted to be sure we were TCP-only)
    • Configured a bandwidth cap at the network level to prevent the ISPN cluster from saturating the network and impacting other hosts that utilize the same link.