database solr distributed-computing solrcloud

How does solr handle high availability?

I can't understand how does solr handle high availability in solrCloud. In its reference guide it pointed out that it uses CDCR to handle HA. But I think that this is an expensive strategy.

Can anyone tell what it actually handle HA and why is it an optimum way? Thanks a lot.

Solution

There are a few levels of HA - you need to ask yourself, what kinds of failures can I tolerate? Things like:

Node failure
Multiple Node failures
Rack failure
Data Center failure
Region failure

SolrCloud's basic cluster setup provides you the tools to cover #1-3 pretty easily. Add replicas, distribute them correctly among racks.

You can get #4, or even #5, using a single SolrCloud cluster spread around multiple data centers (Multi-AZ in AWS for #4, or Multi-Region in AWS for #5), but a single SolrCloud cluster doesn't have any locality awareness, so you need to understand that intra-cluster communication will often be cross-data-center, so the data centers really need to be low-latency between each other or your query latency will suffer badly.

SolrCloud's CDCR is a way to connect two or more independent SolrCloud clusters, and essentially create master/slave relationships between clusters. This gives you #4 or #5 without the penalty of cross-cluster traffic latency.