Search code examples
solrsolrcloud

Recoverying from single shard loss with replica in solrcloud


I have a solrcloud cluster which has a collection with RF=2 and NumShards=3 on 6 Nodes. We want to test how to recover from unexpected situations like shard loss. So we will probably execute an rm -rf on the solr data directory on one of the replica or master. Now the question is, how will this shredded node recover from the shard loss? Are manual steps required(if yes, then what needs to be done), or will it automatically recover from the replica?


Solution

  • You haven't specified a solr version, but here's a synopsis of some of the concepts:

    1. SolrCloud records cluster state in two places. The local disk of the node, and in ZooKeeper. When Solr starts on a node, it scans its local disk for solr "Cores", (Replicas, in this case) and if it finds any, it registers itself in ZK as serving that replica. If according to ZK it's not the Leader of the shard for that replica, it'll sync itself from the Leader before it starts serving traffic.

    2. Leader (I avoid Master/Slave terminology here, because that's generally used in a non-solrcloud setup) for a shard is an ephemeral role. If the leader goes down, a non-leader will be elected the new Leader and life goes on. If the former Leader comes back, it's a non-leader now. Generally you don't need to concern yourself with which replica is the leader.

    3. SolrCloud does not generally assign replicas automatically. You explicitly tell it where you want things.

    Given these things, your intended "failure mode" is a bit interesting. Deleting the files from a running JVM probably won't do much. The JVM has an open filehandle to all the index files, so the OS can't clean them up even though you've deleted the references. Things will probably continue normally until the next time Solr needs to write a new segment file to a directory that no longer exists, at which point things will explode. I don't know exactly what.

    If you stop Solr, delete the directory, and restart Solr though - You've deleted the knowledge that that Solr node is participating in any index. Solr will come up, and join the cluster, and not host any replicas of any shard. You'll probably need to ADDREPLICA to put it back.