Search code examples
apache-zookeepermesosmesospheremarathon

Mesos cluster fails to elect master when using replicated_log


  • Test environment: multi-node mesos 0.27.2 cluster on AWS (3 x masters, 2 x slaves, quorum=2).
  • Tested persistence with zkCli.sh and it works fine.
  • If i start the masters with --registry=in_memory, it works fine, master is elected, i can start tasks via Marathon.
  • If i use the default (--registry=replicated_log) the cluster fails to elect a master:

https://gist.github.com/mitel/67acd44408f4d51af192

EDIT: apparently the problem was the firewall. Applied an allow-all type of rule to all my security groups and now i have a stable master. Once i figure out what was blocking the communication i'll post it here.


Solution

  • Discovered that mesos masters also initiate connections to other masters on 5050. After adding the egress rule to the master's security group, the cluster is stable, master election happens as expected. firewall rules

    UPDATE: for those who try to build an internal firewall between the various components of mesos/zk/.. - don't do it. better to design the security as in Mesosphere's DCOS