docker kubernetes deployment apache-flink

How to handle flink management and k8s management

I'm considering deploying Flink with K8s. I'm a newbie on Flink and have a simple question:

Saying that I use K8s to manager dockers and deploy the TaskManager into the dockers.

As my understanding, a docker can be restarted by K8s when it fails, and a Task can be restarted by Flink when it fails.

If a Task is running in a container of docker and the container suddenly fails for some reason, in the Flink's view, a Task failed so the task should be restarted, and in the K8s' view, a container failed so the docker should be restarted. In this case, should we worry about some conflict because of the two kinds of "be restarted"?

Solution

I think you want to read up on the official kubernetes setup guide here: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html

It describes 3 ways of getting it to work:

Session Cluster: This involves spinning up 2 deployments in the appendix and requires you to submit your Flink job manually or via a script in the beginning. This is very similar to a local standalone cluster when you are developing, except it is now in your kubernetes cluster
Job Cluster: By deploying Flink as a k8s job, you would be able to eliminate the job submission step.
Helm chart: By the look of it, the project has not updated this for 2 years, so your mileage may vary.

I have had success with a Session Cluster, but I would eventually like to try the "proper" way, which is to deploy it as kubernetes job using the 2nd method by the looks of it.

Depending on your Flink Source and the kind of failure, your Flink job will fail differently. You shouldn't worry about the "conflict". Either Kubernetes is going to restart the container, or Flink is going to handle the error it could handle. After a certain amount of retry it would cancel, depending on how you configured this. See Configuration for more details. If the container exited with a code that is not 0, Kubernetes would try to restart it. However, it may or may not resubmit the job depending on whether you deployed the job in a Job Cluster or whether you had an initialization script for the image you used. In a Session Cluster, this can be problematic depending on whether the job submission is done through task manager or job manager. If the job was submitted through task manager, then we need to cancel the existing failed job so the resubmitted job can start.

Note: if you did go with the Session Cluster and have a file system based Stateful Backend (non-RocksDB Stateful Backend) for checkpoints, you would need to figure out a way for the job manager and task manager to share a checkpoint directory.

If the task manager uses a checkpoint directory that is inaccessible to the job manager, the task manager's persistence layer would build up and eventually cause some kind of out of disk space error. This may not be a problem if you decided to go with RocksDB and enable incremental checkpoints