Cadence is a fault tolerant stateful code platform. How does cadence handle fault in various failure condition?
There are al kinds of failures in distributed systems and Cadence provides various options to them.
Here is the list from myself. It may not be complete. But I will try add more if I can think of.
By design of event sourcing models, a workflow can recover from any point left when a worker crashed. See https://cadenceworkflow.io/docs/concepts/workflows/#state-recovery-and-determinism
Workflow can also have retry policy like activity to retry on failure automatically https://cadenceworkflow.io/docs/concepts/workflows/#workflow-retries
On certain scenarios the failure is caused by bad code change which leads to wrong states. Cadence provides “reset” tool to reset workflow to any point of time. See https://cadenceworkflow.io/docs/cli/#reset-and-restart
On top of reset, Cadence also allows you to reset by deployment. This is useful to reset a big number of workflow(eg millions of).
Both activity and workflow workers are stateless.
Cadence server is a highly available and scalable service provides the durability.