I have a 3 box Solr cloud setup with ZooKeeper, each server has a Solr and ZK install (not perfect I know). Everything was working fine until a network outage this morning.
Post outage boxes A and C came back as expected. Box B did not, a restart of the Solr service revealed an error which states
A previous ephemeral live node still exists. Solr cannot continue.
Upon looking in the B node ZooKeeper Live_Nodes
path the Solr install is already showing as an active live node even though Solr is off. This node is not shown on boxes A and B within the Live_nodes
path. I'm also unable to delete
or rmr
this node because ZooKeeper is telling that it doesn't exist.
I have attempted Solr stop -all
in case there was a hidden process that I wasn't seeing but Solr states that there are no instances running.
Next move was installing a fresh ZooKeeper instance on B. After that was up a ls /live_nodes
continues showing this solr instance that doesn't exist.
Any help is appreciated. Thank you.
FYI, I continued troubleshooting and eventually rebuilt all 3 ZooKeeper nodes. That led me to a separate error of showing that the collection shard was broken. After troubleshooting the 'clusterstate.json' file, what ended up being the fix was creating a duplicate collection with a separate name and then an alias for redirecting traffic. After this I was able to delete the broken collection.
I'm thinking a duplicate collection and alias would have fixed it whole time.
Hopefully this helps someone in the future. Thanks.