apache-zookeeper mesos marathon mesosphere

What conditions cause a Marathon leader election?

I'm using using Mesos and Marathon to manage application deployments, and have run into this bug in Marathon https://github.com/mesosphere/marathon/issues/3783 , which is to say that a leader election during a deployment scales instances down to 0. Leader elections were happening very frequently (approximately once every 30 minutes), and so I'm hitting this issue fairly often.

I know once every 30 minutes is highly irregular, because I've since upgrading to Marathon 1.3.10 and have been election-free for past 2 days, but how often is "normal"? Does leader abdication / election happen under normal conditions, or should I expect 0 elections unless there is an underlying issue? It was suggested to me by a colleague that "leader elections are normal" and that a "certain number of elections are normal and to be expected". I just don't believe that, and would like to know for sure.

Solution

This is not normal if your Marathon reelects every 30 minutes. In normal circumstances Marathon should not abdicate or reelect new leader until maintenance occurs (update or restart). Although if this happens it could be caused by 4 main problems (all results in timeouts):

Marathon performance — when marathon has a performance problem, one of the symptom is loosing leadership. This is because Marathon does not responds to Zookeeper in given interval and is marked as gone.
Marathon Zookeeper connection issues — when network delay are too high (e.g., Zookeeper cluster is located in different DC than Marathon) then some updates can timeout. This will result in loosing leadership.
Zookeeper performance — when Zookeeper has to much job to do it will timeout some requests causing Marathon to loose leadership.
Marathon forced to abdicated by DELETE /v2/leader

To fix performance problems follow below steps described here

Shard your marathon.

Monitor — enable metrics but remember to configure them.

Update to 1.3.10 or later.

Minimize Zookeeper communication latency and object size.

Tune JVM — add more heap and CPUs :).

Do not use the event bus — if you really need to, use filtered SSE, and accept it is asynchronous and events are delivered at most once.

If you need task life cycle events, use a custom executor.

Prefer batch deployments to many individual ones.