Search code examples
apache-flink

why the stop-cluster.sh stop the latest started Flink cluster?


I'm plan to do an upgrading from Flink 1.5.2 to 1.6.0, and then do the jobs migration. In order to minimum the pause time for the jobs, I plan to run both Flink clusters at the same time, after migrating jobs successfully, I would stop the old one. However when I tried to stop the Flink cluster by running stop-cluster.sh in the directory Flink1.5.2/bin , I found the stopped cluster is Flink 1.6.0 instead of the expected Flink 1.5.2 .

I did some test and found the stop-cluster.sh just stop the latest started Flink cluster, that is to say, if you start cluster 1.6.0 firstly, then starts Flink 1.5.2, after that when you run stop-cluster.sh, it would stop Flink 1.5.2 firstly even you run the stop-cluster.sh at the cluster 1.6.0 directory Flink1.6.0/bin. Based on my understanding, when running the stop-cluster.sh at the Flink1.6.0/bin it should stop cluster 1.6.0, and stop the cluster 1.5.2 when running the stop-cluster.sh at the Flink1.5.2/bin , however it didn't.

I did some research and found the stop-cluster.sh would kill the process based on the file which contains the pid , however I don't know the location of that file, and I suspect both of the clusters write the pid in the same places when they started, which make the stop-cluster.sh chaotic.

Please advise how to stop the specified cluster.


Solution

  • Per default, the pid file is written to /tmp and has the name flink-<USER>-<FLINK_COMPONENT>.pid. You can control the directory by setting the env.pid.dir configuration in flink-conf.yaml. By using different pid file directory you can keep control over the different clusters.