How does cadence handle fault in various failure condition?

Cadence is a fault tolerant stateful code platform. How does cadence handle fault in various failure condition?

Solution

There are al kinds of failures in distributed systems and Cadence provides various options to them.

Here is the list from myself. It may not be complete. But I will try add more if I can think of.

Activity failure and retry. See https://cadenceworkflow.io/docs/concepts/activities/#timeouts
Also note that long running activity can recover from checkpoints via “heartbeat “

By design of event sourcing models, a workflow can recover from any point left when a worker crashed. See https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism
Workflow can also have retry policy like activity to retry on failure automatically https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
On certain scenarios the failure is caused by bad code change which leads to wrong states. Cadence provides “reset” tool to reset workflow to any point of time. See https://cadenceworkflow.io/docs/cli/#reset-and-restart
On top of reset, Cadence also allows you to reset by deployment. This is useful to reset a big number of workflow(eg millions of).

Both activity and workflow workers are stateless.

Cadence server is a highly available and scalable service provides the durability.

The durability is from underlying design and persistence storage ( by either Cassandra, MySQL or Postgres)
In a single cluster setup, Cadence service is running with different independent shards. The whole cluster consists of different hosts. Any failed host can be replaced by another.
Cadence provides Cross data center replication to provide much higher availability https://cadenceworkflow.io/docs/concepts/cross-dc-replication/#global-domains-architecture