Search code examples
databasedistributed-databaseyugabytedb

How is YugaByte DB's replication model?


How similar or different is YugaByte DB's replication model compared to PostgreSQL master-slave replication?


Solution

  • PostgreSQL is a single node RDBMS. Tables are not horizontally partitioned/sharded into smaller components since all data is served from a single node anyway. In a highly available (HA) deployment, a master-slave node pair is used. The master is responsible for handling writes/updates to the data. Committed data gets asynchronously copied over to a completely independent "slave" instance. Upon master failure, the application clients can start talking to the slave instance with a caveat that they won't see the data that has been recently committed at the master but not yet copied over to the slave. Usually the master-to-slave failover, master repair and slave-to-master failback are handled manually.

    On the other hand, YugaByte DB is a Google Spanner-inspired distributed RDBMS where HA deployments start with a minimum of 3 nodes. Tables are horizontally partitioned/sharded into smaller components called "tablets". Tablets are distributed to all the available nodes evenly. Each tablet is made resilient to failures by automatically storing 2 additional copies on 2 additional nodes, leading to a total of 3 copies. These 3 copies are known as a tablet group. Instead of managing replication at the overall instance level involving all of the data (as PostgreSQL does), YugaByte DB's replication is managed at the individual tablet group level using a distributed consensus protocol called Raft.

    Raft (along with a mechanism called Leader Leases) ensures that only 1 of the 3 copies can be a leader (responsible for serving writes/updates and strongly consistent reads) at any time. Writes for a given row are handled by the corresponding tablet leader which commits data locally as well as at least 1 follower before acknowledging success to the client application. Loss of a node leads to loss of tablet leaders hosted on that node. New updates on those tablets cannot be processed till new tablet leaders get auto-elected among the remaining followers. This auto-election usually takes a few seconds and is primarily dependent on the network latency across the nodes. After the election completes, the cluster is ready to accept writes even for the tablets who were impacted by the node failure.

    The Google Spanner design followed by YugaByte DB does require committing the data at one more copy than PostgreSQL, which means increased write latency compared to PostgreSQL. But in return, the benefits derived are that of built-in repair (through leader auto-election) and failover (to the new leader after election). Even failback (after the failed node is back online) is automatic. This design is a better bet when infrastructure running the DB cluster is expected to fail more often than before.

    For more details, refer to my post on this topic.