Search code examples
apachesolrapache-zookeepersolrcloud

ZooKeeper showing non-existent node after network outage


I have a 3 box Solr cloud setup with ZooKeeper, each server has a Solr and ZK install (not perfect I know). Everything was working fine until a network outage this morning.

Post outage boxes A and C came back as expected. Box B did not, a restart of the Solr service revealed an error which states A previous ephemeral live node still exists. Solr cannot continue.

Upon looking in the B node ZooKeeper Live_Nodes path the Solr install is already showing as an active live node even though Solr is off. This node is not shown on boxes A and B within the Live_nodes path. I'm also unable to delete or rmr this node because ZooKeeper is telling that it doesn't exist.

I have attempted Solr stop -all in case there was a hidden process that I wasn't seeing but Solr states that there are no instances running.

Next move was installing a fresh ZooKeeper instance on B. After that was up a ls /live_nodes continues showing this solr instance that doesn't exist.

Any help is appreciated. Thank you.


Solution

  • FYI, I continued troubleshooting and eventually rebuilt all 3 ZooKeeper nodes. That led me to a separate error of showing that the collection shard was broken. After troubleshooting the 'clusterstate.json' file, what ended up being the fix was creating a duplicate collection with a separate name and then an alias for redirecting traffic. After this I was able to delete the broken collection.

    I'm thinking a duplicate collection and alias would have fixed it whole time.

    Hopefully this helps someone in the future. Thanks.