Search code examples
solrapache-zookeeper

Solr Admin Connection reset when connecting to Zookeeper


I am in process of setting up a Solr (8.11.1) + Zookeeper (3.6.2) set up that has nodes in two different data centers. The eventual aim is to have fault tolerance where even if one whole data center is offline the Solr + Zookeeper set up still works. In our test environment have a running ZK ensemble with the 5 nodes

4 Nodes are in one Data Center (in EU region) and 1 node in a different Data center (North America). For rest of the discussion we will call this 5th node as Zk5 Apart from that there are 4 Solr nodes (in EU region). Each of the solr point to the 4 ZK EU nodes in configuration (ZK host)

With this configuration both Solr and Zookeeper seem to run fine. ZK is able to handle failure of up to 2 nodes. If we bring down a third node then the whole ensemble becomes unresponsive. This is expected as per ZK documentation.

However in this set up the ZK Status page in Solr presents a few warnings that says that Solr is not connected to all the nodes in the ZK ensemble. enter image description here

If we include the ZK5 node (NA region) in the Solr ZK_HOST url config then the ZK status page shows connection error and no information regarding Zookeeper shows up. enter image description here

We see an exception message in the Solr logs in this case saying connection reset

2022-09-23 13:26:15.850 ERROR (qtp131206411-30) [   ] o.a.s.s.HttpSolrCall java.io.UncheckedIOException: java.net.SocketException: Connection reset => java.io.UncheckedIOException: java.net.SocketException: Connection reset

    at java.io.BufferedReader$1.hasNext(BufferedReader.java:574) java.io.UncheckedIOException: java.net.SocketException: Connection reset

    at java.io.BufferedReader$1.hasNext(BufferedReader.java:574) ~[?:1.8.0_141]

    at java.util.Iterator.forEachRemaining(Iterator.java:115) ~[?:1.8.0_141]

    at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[?:1.8.0_141]

    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_141]

    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_141]

    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[?:1.8.0_141]

    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_141]

    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[?:1.8.0_141]

    at org.apache.solr.handler.admin.ZookeeperStatusHandler.getZkRawResponse(ZookeeperStatusHandler.java:302) ~[?:?]

    at org.apache.solr.handler.admin.ZookeeperStatusHandler.monitorZookeeper(ZookeeperStatusHandler.java:254) ~[?:?]

    at org.apache.solr.handler.admin.ZookeeperStatusHandler.getZkStatus(ZookeeperStatusHandler.java:155) ~[?:?]

    at org.apache.solr.handler.admin.ZookeeperStatusHandler.handleRequestBody(ZookeeperStatusHandler.java:95) ~[?:?]

    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:216) ~[?:?]

    at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:836) ~[?:?]

    at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:800) ~[?:?]

    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:545) ~[?:?]

    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427) ~[?:?]

    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:357) ~[?:?]

    at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:201) ~[jetty-servlet-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601) ~[jetty-servlet-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548) ~[jetty-servlet-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600) ~[jetty-security-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1434) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501) ~[jetty-servlet-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1349) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322) ~[jetty-rewrite-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:763) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:179) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:400) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:645) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:392) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[jetty-server-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[jetty-io-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[jetty-io-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[jetty-io-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[jetty-util-9.4.44.v20210927.jar:9.4.44.v20210927]

    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

Caused by: java.net.SocketException: Connection reset

    at java.net.SocketInputStream.read(SocketInputStream.java:210) ~[?:1.8.0_141]

    at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_141]

    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_141]

    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_141]

    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_141]

    at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_141]

    at java.io.BufferedReader.fill(BufferedReader.java:161) ~[?:1.8.0_141]

    at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[?:1.8.0_141]

    at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[?:1.8.0_141]

    at java.io.BufferedReader$1.hasNext(BufferedReader.java:571) ~[?:1.8.0_141]

    ... 57 more 

In this set up when we shutdown the ZK5 node then the ZK status page shows proper information.

This implies that Solr has connectivity issues with only the ZK5 node.

Other tests performed for ZK connectivity

  1. Ping works from all Solr nodes to all ZK nodes
  2. Telnet on client port (2181) works from all Solr nodes to all ZK nodes Based on this question

These indicate that there are no Firewall issues and the ZK Server process are up and running all the time.

Overall this appears to be a Solr to ZK5 connectivity issue. It is unclear at this moment if this issue is due to Zk5 being in a different DC / Region or not.


Edit - 27-Sep-2022

More information from further tests. We tried configuring 3 Zookeeper nodes in the same region but 1 nodes in 1 DC and two nodes in the second DC. Now mostly things work fine but we see the connection reset error intermittently.


Any ideas or suggestions to find out what is wrong and how we can get this working please.


Solution

  • This question has been open for a while without any comments or responses. Therefore I am answering my own question for benefit of others who see similar issue.

    After looking around the internet and especially Apache Solr's issue tracker I found one issue which offered similar stack trace as our issue https://issues.apache.org/jira/browse/SOLR-15849

    The description looks very much like the issue that we notice on the Solr Admin UI. Per Solr's release page this one is fixed in Solr release 8.11.2 which came out in June 2022. This was a few days after we upgraded to Solr 8.11.1

    We have since run an upgrade on our test cluster per the official notes. After the upgrade the errors have disappeared.