How to speed up the restoration process from a 400MB Flink savepoint in S3 storage?

I am running a Flink job in a Kubernetes cluster using Flink Kubernetes Operator, with a parallelism of 3. Each operator in the job maintains an average state size of around 140MB. However, I have encountered significant delays when attempting to restore a 400MB savepoint from S3 during job restarts.

Given the setup of Flink in a Kubernetes environment with S3 storage, I am seeking advice on how to efficiently and expeditiously restore the state from the savepoint.

Specifically, I would like to address the following points:

Strategies and optimizations to speed up the restoration process for a Flink job with multiple parallelism and operator state of approximately 140MB, running in a Kubernetes cluster. Recommended configurations or adjustments within the Flink Kubernetes Operator or Flink configuration that may improve savepoint restoration performance from S3. Any best practices or Kubernetes-specific considerations that could help reduce the time it takes to restore the state and streamline job restarts. If you have experience or insights in running Flink in a Kubernetes environment and dealing with savepoint restoration from S3, I would greatly appreciate your valuable advice. Thank you!

Solution

Depending on why you are taking a savepoint, you may be able to speed up the restart process by using a more efficient type of state snapshot. This table in the Flink docs lays out the 4 types of snapshots against the different operational tasks you might be undertaking, and explains which snapshot types support those different actions.

Canonical savepoints are the most flexible format, but are more expensive to restore (assuming you are using the RocksDB state backend). This is because the RocksDB SST files have to be rebuilt from the data stored in the savepoint. If you can use a native savepoint, it should be faster, possibly much faster.

Another way to speed up restarts is with Task Local Recovery. I haven't tried to get this working on Kubernetes, but see the docs on Enabling Local Recovery Across Pod Restarts.

Disclaimer: I'm not familiar with how the Flink Kubernetes Operator implements these operations, or how to configure it. It may or may not support what I've suggested.