I am setting up a Mesos cluster. Our setup is:
3 Primary boxes (8gb RAM, 4 cpu) 3 Worker boxes (1gb RAM, 1 cpu)
The configuration files I have are all matching and proper from what I can see. In /etc/mesos/zk
I have:
zk://106.133.117.128:2181,zk://153.213.95.171:2181,zk://106.121.34.29:2181/mesos
(I changed the IP addresses from the actual ones, but will use them with these same numbers when referenced throughout)
I am not quite sure where to go from here. I have stepped through each piece of configuration.
The ID's are located at /etc/zookeeper/conf/myid
on each machine and properly set up. In the config on each machine for zookeeper conf they are set to the matching IP and id as well.
My Quorum size is 2.
IP and hostname are set to the IP of each machine respectively.
The configuration for marathon in /etc/marathon/conf/master
reads:
zk://106.133.117.128:2181,zk://153.213.95.171:2181,zk://106.121.34.29:2181/marathon
The exact error from the logs is:
Log file created at: 2015/10/01 13:56:32
Running on machine: mesos-primary-1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1001 13:56:32.595760 6618 logging.cpp:172] INFO level logging started!
I1001 13:56:32.596060 6618 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1001 13:56:32.596082 6618 main.cpp:231] Version: 0.24.1
I1001 13:56:32.596094 6618 main.cpp:234] Git tag: 0.24.1
I1001 13:56:32.596106 6618 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1001 13:56:32.596161 6618 main.cpp:252] Using 'HierarchicalDRF' allocator
I1001 13:56:32.602738 6618 leveldb.cpp:176] Opened db in 6.456045ms
I1001 13:56:32.611217 6618 leveldb.cpp:183] Compacted db in 8.423531ms
I1001 13:56:32.611312 6618 leveldb.cpp:198] Created db iterator in 22068ns
I1001 13:56:32.611348 6618 leveldb.cpp:204] Seeked to beginning of db in 1287ns
I1001 13:56:32.611372 6618 leveldb.cpp:273] Iterated through 0 keys in the db in 376ns
I1001 13:56:32.611448 6618 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1001 13:56:32.647243 6648 log.cpp:238] Attempting to join replica to ZooKeeper group
I1001 13:56:32.689388 6651 recover.cpp:449] Starting replica recovery
I1001 13:56:32.690028 6651 recover.cpp:475] Replica is in EMPTY status
W1001 13:56:32.690147 6649 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
W1001 13:56:32.690726 6644 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
W1001 13:56:32.690768 6647 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
W1001 13:56:32.690821 6645 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
I1001 13:56:32.690891 6618 main.cpp:465] Starting Mesos master
I1001 13:56:32.691463 6618 master.cpp:378] Master 20151001-135632-2088076136-5050-6618 (104.131.117.124) started on 106.133.117.128:5050
I1001 13:56:32.691494 6618 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="106.133.117.128" --initialize_driver_logging="true" --ip="106.133.117.128" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://106.133.117.128:2181,zk://153.213.95.171:2181,zk://106.121.34.29:2181/mesoss" --zk_session_timeout="10secs"
I1001 13:56:32.691671 6618 master.cpp:427] Master allowing unauthenticated frameworks to register
I1001 13:56:32.691700 6618 master.cpp:432] Master allowing unauthenticated slaves to register
I1001 13:56:32.691725 6618 master.cpp:469] Using default 'crammd5' authenticator
W1001 13:56:32.691756 6618 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1001 13:56:32.691790 6618 authenticator.cpp:512] Initializing server SASL
I1001 13:56:32.695333 6646 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1001 13:56:32.695377 6646 contender.cpp:149] Joining the ZK group
W1001 13:56:33.690989 6649 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
W1001 13:56:33.691220 6644 zookeeper.cpp:101] zookeeper_init failed: Invalid argument ; retrying in 1 second
Any input is much appreciated.
You mesos zookeeper string is malformatted. It should be of the form
zk://host1:port1,host2:port2,host3:port3/path
That should fix your issue (barring any other configuration problems).
Answered on #mesos on freenode as well.