Search code examples
databasesolrdistributed-computingsolrcloud

How does solr handle high availability?


I can't understand how does solr handle high availability in solrCloud. In its reference guide it pointed out that it uses CDCR to handle HA. But I think that this is an expensive strategy.

Can anyone tell what it actually handle HA and why is it an optimum way? Thanks a lot.


Solution

  • There are a few levels of HA - you need to ask yourself, what kinds of failures can I tolerate? Things like:

    1. Node failure
    2. Multiple Node failures
    3. Rack failure
    4. Data Center failure
    5. Region failure

    SolrCloud's basic cluster setup provides you the tools to cover #1-3 pretty easily. Add replicas, distribute them correctly among racks.

    You can get #4, or even #5, using a single SolrCloud cluster spread around multiple data centers (Multi-AZ in AWS for #4, or Multi-Region in AWS for #5), but a single SolrCloud cluster doesn't have any locality awareness, so you need to understand that intra-cluster communication will often be cross-data-center, so the data centers really need to be low-latency between each other or your query latency will suffer badly.

    SolrCloud's CDCR is a way to connect two or more independent SolrCloud clusters, and essentially create master/slave relationships between clusters. This gives you #4 or #5 without the penalty of cross-cluster traffic latency.