Search code examples
javaspring-bootconsuldisaster-recovery

How to configure consul-replicate in 2 datacenters for disaster recovery?


I am trying to configure disaster recovery for my applications. And we have a stateful Consul, meaning there will be write operations on consul kv, which should be consistent across datacenters. In other words, if I do a write operation in dc1, and a read operation happens on dc2, I must get the latest value of that key. Here's my thought process: I am going to join two datacenters via wan join. Please note each datacenter has 4 servers. And any write operation on dc1 will be replicated to dc2 via consul-replicate tool. I tried ACL replication, but it seems complicated. I also searched online on consul-replicate configs example but could not find anything helpful. Can someone guide me towards the same? Thanks in advance.


Solution

  • You can use consul-replicate to keep KV data in sync between Consul data centers, but you'll want to keep a couple of things in mind about the setup.

    Consul replicate uses blocking queries to watch for changes under the configured KV prefix. The way blocking queries work today is that if a single key is updated under your watched prefix, the blocking query will return the data for all keys under that prefix – even though only one key was updated – causing Consul replicate to PUT/update all watched keys in the remote DC. Depending on how many keys you are replicating, and how often those keys are updated, you may see higher bandwidth utilization and encounter performance issues.

    In order to mitigate the performance concerns, you can run multiple consul-replicate processes, each one responsible for replicating a specific, scoped prefix from the KV tree (e.g., /env/prod and /env/dev) instead of the entire tree. https://github.com/hashicorp/consul/issues/2791 is a feature request to improve the behavior of watch so it only returns data for changed keys instead of all keys being watched.

    Also, Consul replicate does not provide the same data guarantees as Raft within a Consul cluster. By that I mean since the replication is external to Consul, when a user/service writes to a KV in DC1, there is no way for Consul to return an error if that KV has not been successfully replicated to DC2. Consul is completely unaware the data is being replicated at all. Replication performed by consul-replicate is async and essentially best effort.

    As far as setting this all up, the README for Consul replicate is fairly detailed and provides an explanation of each of the available configuration options. I imagine minimal config for a consul-replicate daemon running in DC2 which is replicating from DC1 would probably look like this.

    # This denotes the start of the configuration section for Consul. All values
    # contained in this section pertain to Consul.
    consul {
      address = "127.0.0.1:8500"
      token = "<token>"
    }
    
    # This is the path to store a PID file which will contain the process ID of the
    # Consul Replicate process. This is useful if you plan to send custom signals
    # to the process.
    pid_file = "/var/run/consul-replicate/pid"
    
    # This is the prefix and datacenter to replicate and the resulting destination.
    prefix {
      source      = "env/prod"
      datacenter  = "dc1"
      destination = "env/prod"
    }