Mesosphere marathon restart task on all instances

I am running a mesos cluster on 3 instances each running both mesos-master and mesos-slave. I believe the cluster to be configured correctly and able to run web app via docker and marathon on all three instances.

I set up a Jenkins to perform a deployment to the cluster and as the last step post to marathon REST API to restart the job, however it fails silently (simply stuck at deployment stage) . However, if the app is running on 2 instances, restart goes smoothly. Does marathon require one instance to be unoccupied to perform app restart?

Am I missing something here?

Solution

Are there enough free resources in your cluster? IIRC the default restart behavior will first start the new version and then scale down the old version (hence you need 2* app resources). See Marathon Deployments for details and the upgrade strategy section here.

Here the relevant excerpt from the upgrade strategy:

upgradeStrategy

During an upgrade all instances of an application get replaced by a new version. The upgradeStrategy controls how Marathon stops old versions and launches new versions. It consists of two values:

minimumHealthCapacity (Optional. Default: 1.0) - a number between 0and 1 that is multiplied with the instance count. This is the minimum number of healthy nodes that do not sacrifice overall application purpose. Marathon will make sure, during the upgrade process, that at any point of time this number of healthy instances are up.

maximumOverCapacity (Optional. Default: 1.0) - a number between 0 and 1 which is multiplied with the instance count. This is the maximum number of additional instances launched at any point of time during the upgrade process.

The default minimumHealthCapacity is 1, which means no old instance can be stopped before another healthy new version is deployed. A value of 0.5 means that during an upgrade half of the old version instances are stopped first to make space for the new version. A value of 0 means take all instances down immediately and replace with the new application.

The default maximumOverCapacity is 1, which means that all old and new instances can co-exist during the upgrade process. A value of 0.1 means that during the upgrade process 10% more capacity than usual may be used for old and new instances. A value of 0.0 means that even during the upgrade process no more capacity may be used for the new instances than usual. Only when an old version is stopped, a new instance can be deployed.