Search code examples
apache-zookeepercoreosmesospheredcos

Flapping metrics in DC/OS dashboard after changing master nodes


After changing two of three master nodes in an DC/OS 1.8 cluster to a newer CoreOS version (one with a kernel that is patched against the DirtyCOW vulnerability) the masters stopped working. The dashboard showed an empty data center.

We synchronized /var/lib/dcos from the old master to the two new master nodes. Then the dashboard started working again. The DC/OS dashboard still shows flapping metrics. We have a mesos.leader and a zookeeper leader.

How can we stabilize the cluster?


Solution

  • Last time this happened to us we had to reinstall the cluster. I just finished stopping our master nodes one at a time to increase the disk size. We are now back in the flapping state. I think a reinstall is in our future. I'm searching for answers now to help avoid that.