apache-flink flink-streaming flink-cep flink-statefun

Apache Flink - FsStateBackend - How state is recovered in case of Task Manager failure which stores state in its local file system

Assume we have 2 Job Managers (ZooKeeper for HA) and 3 Task Managers. I have configured FsStateBackend for checkpointing. I assume that the FsStateBackend runs in each of the Task Managers which maintains the state in the memory. Upon checkpoint, the state is persisted in the path which we have configured(file:/data). Basically I have configured the path pointing to the local file system. So each of the Task Managers has its own local disk storage where checkpointed data is persisted. As per my understanding a small meta data is sent to Job Manager on checkpointing.

What happens If one of the Task Manager crashes? It is for sure that the tasks are started in any of the available Task Managers. How is the job state recovered since the Task Manager's (Task Manager which crashed) checkpointed data is not available since it is down? Does the checkpoint process send the state information to Job Manager?
What is the metadata sent by Task Manager to Job Manager during the checkpointing?
Does the file system which we are using is supposed to be distributed state? E.g. NFS, S3. What happens If we use the systems local storage for checkpointing.

Thank you

Solution

You should always use a distributed file system for checkpointing. Something like HDFS, S3, GFS, NFS, Ceph, etc. Furthermore, the storage path used must be accessible from all participating processes/nodes (i.e. all Task Managers and Job Managers).

Otherwise, as you've pointed out, the checkpoint data would be lost if a local disk failed.

The Job Manager has complete knowledge concerning checkpointing, and if you have HA configured, this information is stored in the configured HA storage provider in order to enable Job Manager failover.