Search code examples
mesosmarathon

Migrate Marathon apps for mesos-slave graceful shutdown


I have a small Mesos cluster and I'm using Marathon to manage a set of long-running services with a variable number of instances each.

I'd like to be able to launch new nodes or terminate some of them as required by business needs. However, when terminating a node I realized there is a potential problem: when I shut down a Mesos slave, it happens that the number of instances of some services falls temporarily below the defined minimumHealthCapacity. That can lead to some downtime if, for example, the machine to be stopped is running a service with only one instance.

Consider the following simplified scenario: node 1 is running service A, node 2 is running service B and node 3 is running service C. The minimumHealthCapacity for all services is 1. I want to terminate node 1 and leave only 2 and 3 running. I don't want any downtime on service A. An example of intended behavior would be to scale service A to 2 and then safely terminating node 1.

What can I do to make sure no service falls below the minimumHealthCapacity?

Ideally, I would have a rolling-update inspired process for that - replacements are launched in separate machines, followed by the termination of the services in the machine to be shut down. I would like to have at least an automated process to do that, so that a scale down is a simple script away. I have no requirement for the amount of time it takes to do that, i.e. I can shut down the Mesos slave only after I'm sure the Marathon migration is finished and successful.


Solution

  • The Mesos dev team is currently working on "Maintenance Primitives" so that an operator can indicate that a particular machine is scheduled to go down at a certain time (or ASAP), triggering messages to each framework notifying them of the intended unavailability window. A framework like Marathon could then decide to migrate its tasks away from that node so that it can be safely terminated without any service downtime.

    See https://issues.apache.org/jira/browse/MESOS-1474 for more details/patches.