Search code examples
aeron

Why does aeron cluster not read snapshots when starting from scratch?


I'm testing some failure and recovery scenarios, and I've come up with a behavior that I do not completely understand.

We have a three node cluster. We perform some operations, take a snapshot, then one of the nodes dies. I'm simulating the situation where the node loses its storage.

When I restart the node, apparently it replays the operations, without starting from the snapshot, as it would do if storage had not been lost.

I think this is problematic because snapshots not only speed up the startup, but more importantly are a way to provide stability with respect to changing software. When software evolves, the behavior of the cluster to the same input messages might change. If we replay the old data with the new code, we might end up in a different situation. The snapshot helps with this by recording the state the cluster reached with the old versions, and allowing to start from that state onward with the new version. But in the scenario above, aeron appears to be skipping snapshots, so the failed node will end up in a different state than the surviving nodes.

Is there something wrong with this reasoning? Is there something I can do to recover from such a failure (apart from taking backups)?


Solution

  • If a node loses its storage then it needs to have the leader replay the log to it from the beginning of time. Otherwise you need a means to getting a backup of the storage from one of the other nodes or some archive. ClusterBackup provides a basic mechanism to do this. Aeron has a premium offering which includes standby nodes that can do node replacement for when the storage lost by providing an elegant recovery mechanism.