I have been following this tutorial How to configure a production ready Mesos cluster and have been creating an ansible playbook along the way which you can see here mesos ansible playbook
Ansible runs successfully and I can visit my port 5050 on a master and see the mesos dashboard. However there seems to be 3 problems which are hopefully all connected but seem seperate at face value.
Any ideas of what I have done wrong or if anything has changed since this tutorial was published?
Edit: tried to dig in deeper. After running ansible I logged into each node and restarted the mesos and marathon services myself manually. This appeared to do the trick as I got to the marathon dashboard and then after a bit of fiddling on the slaves I could see those where activated as well. Unfortunately I was not able to reproduce after nuking the nodes and rebuilding. My settings are consistent with the tutorial I linked and the tutorial linked by Celine so I think it is the order I am doing my service restarts. Still looking for any help
Edit2: Copy of logs from one of the masters on startup the last http call just repeats and repeats
I1014 18:56:32.746968 11494 logging.cpp:172] INFO level logging started! I1014 18:56:32.748177 11494 main.cpp:229] Build: 2015-10-12 20:57:28 by root I1014 18:56:32.748277 11494 main.cpp:231] Version: 0.25.0 I1014 18:56:32.748345 11494 main.cpp:234] Git tag: 0.25.0 I1014 18:56:32.748406 11494 main.cpp:238] Git SHA: 2dd7f7ee115fe00b8e098b0a10762a4fa8f4600f I1014 18:56:32.748615 11494 main.cpp:252] Using 'HierarchicalDRF' allocator I1014 18:56:32.759768 11494 leveldb.cpp:176] Opened db in 10.929155ms I1014 18:56:32.763638 11494 leveldb.cpp:183] Compacted db in 3.722708ms I1014 18:56:32.763713 11494 leveldb.cpp:198] Created db iterator in 33931ns I1014 18:56:32.763761 11494 leveldb.cpp:204] Seeked to beginning of db in 8624ns I1014 18:56:32.764142 11494 leveldb.cpp:273] Iterated through 1 keys in the db in 352415ns I1014 18:56:32.764263 11494 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned I1014 18:56:32.767266 11520 log.cpp:238] Attempting to join replica to ZooKeeper group I1014 18:56:32.767493 11520 recover.cpp:449] Starting replica recovery I1014 18:56:32.767623 11520 recover.cpp:475] Replica is in VOTING status I1014 18:56:32.767695 11520 recover.cpp:464] Recover process terminated I1014 18:56:32.775274 11494 main.cpp:465] Starting Mesos master I1014 18:56:32.779567 11516 master.cpp:376] Master 75abeaaa-a949-45a3-bd85-bebf100eecad (159.203.107.10) started on 159.203.107.10:5050 I1014 18:56:32.779597 11516 master.cpp:378] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.107.10" --hostname_lookup="true" --initialize_driver_logging="true" --ip="159.203.107.10" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="1" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.107.10:2181,159.203.107.151:2181,159.203.107.162:2181/mesos" --zk_session_timeout="10secs" I1014 18:56:32.779762 11516 master.cpp:425] Master allowing unauthenticated frameworks to register I1014 18:56:32.779770 11516 master.cpp:430] Master allowing unauthenticated slaves to register I1014 18:56:32.779778 11516 master.cpp:467] Using default 'crammd5' authenticator W1014 18:56:32.779798 11516 authenticator.cpp:505] No credentials provided, authentication requests will be refused I1014 18:56:32.779906 11516 authenticator.cpp:512] Initializing server SASL I1014 18:56:32.791836 11515 master.cpp:1542] Successfully attached file '/var/log/mesos/mesos-master.INFO' I1014 18:56:32.792043 11519 contender.cpp:149] Joining the ZK group I1014 18:56:34.968217 11517 http.cpp:336] HTTP GET for /master/state.json from 12.228.115.34:40863 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' I1014 18:56:45.242039 11518 http.cpp:336] HTTP GET for /master/state.json from 12.228.115.34:63018 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' I1014 18:56:55.319259 11519 http.cpp:336] HTTP GET for /master/state.json from 12.228.115.34:50024 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 1
Thanks
This was a zookeeper config problem. None of the tutorials mention needing to set values in zoo.cfg besides listing the server ips. You also need to set dataDir, syncLimit, initLimit, tickTime and clientPort