kubernetes apache-flink high-availability

Multiple "k8s-ha-app1-jobmanager" configmaps on every Flink job run

I have a Flink session cluster on top of Kubernetes and recently I switched from the ZK based HA to Kubernetes HA.

Reading through
https://cwiki.apache.org/confluence/display/FLINK/FLIP-144%3A+Native+Kubernetes+HA+for+Flink#FLIP144:NativeKubernetesHAforFlink-LeaderElection

I can observe on the Flink namespace the configmaps for each resource as described in the docs above:

k8s-ha-app1-00000000000000000000000000000000-jobmanager   2      4m35s  
k8s-ha-app1-dispatcher                                    2      4m38s  
k8s-ha-app1-resourcemanager                               2      4m38s  
k8s-ha-app1-restserver                                    2      4m38s

However, I don't see a single configmap for the "jobmanager" resource. I see as many, as jobs are run accross the day. This can be a high number, so over days, it implies a huge surge of configmaps in the cluster namespace.

The different HA configmaps for the jobmanager seem to differ both in the

"address": "akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_XXX"

(where XXX is increasing)
and the "sessionId" value.

Can someone please explain to me on what basis these "jobmanager" resources are created? At the beginning I thought there might be scheduled cleanup round, but I read at the docs that the HA configmaps are stripped from owner and not deleted. Did I miss to set something so that all the jobs are run against the same session, or some way that I can get these k8s-ha-app1-XXXXXXXXXXXXXXXXXXXXX-jobmanager cleaned up after the job runs?

Solution

The way Flink works internally is that the Dispatcher creates for every submitted job a dedicated JobMaster component. This component needs a leader election and for this purpose it creates a k8s-ha-app1-<JOB_ID>-jobmanager config map. This is the reason why you see multiple xyz-jobmanager ConfigMaps being created.

The reason why these ConfigMaps are not cleaned up is that this currently happens only when the whole cluster is shut down. This is a limitation and the Flink community has created FLINK-20695 in order to fix it. The idea is that the JobMaster related ConfigMaps can be deleted after the job has reached a terminal state.

A bit related is another limitation which hampers the proper clean up in case of a session cluster. If the cluster is shut down with a SIGTERM signal then it is currently not guaranteed that all resources are cleaned up. See FLINK-21008 for more information.