Marathon has 1st-class support for performing rolling (zero-downtime) upgrades on your applications. What if you need to upgrade or reconfigure Mesos itself though?
More specifically I'd like to know if it's possible to upgrade/reconfigure Mesos Master and Slave instances without causing any downtime?
Reconfiguring slaves in a rolling fashion should be trivial, since you can run redundant slave instances.
Would it be safe to upgrade a slave to a later version that the master? In other words is the master kept forwards compatible with respect to the slaves?
According to the operational guide it looks like it would be possible to take down a master node at a time in High Availability mode: http://mesos.apache.org/documentation/latest/operational-guide/
I wonder if the differing versions of master would be compatible however?
I suppose you could spin up a new Mesos cluster and migrate your existing workload across, but this seems like a pain.
Yes, you can upgrade Mesos with 0 downtime for your Tasks. Two releases are supposed to work together in all combinations of master/slaves, usually the Upgrade Guide gives you more details on how to upgrade between two releases.
When upgrading you don't even have to kill your running tasks due to slave due to Slave Recovery.
Btw rolling upgrades was an early Twitter use case, so you can be relatively certain it will remain an important/supported feature.