Search code examples
hadoopapache-sparkmesosmesosphere

Apache Mesos slave cannot connect to master


I've been trying to set up apache mesos, with two machines, one as a slave, the other with a master and a slave. I've been using the mesosphere packages for this.

The slave on the master machine (james-pc) connects fine, but the slave on the other machine doesn't seem to connect. Log messages below.

these are samples. The timestamps may not match. I've run the same commands and read the og files a lot :(

Thanks!!

Slave

I1015 13:44:40.098458 16485 main.cpp:126] Build: 2014-09-23 05:36:09 by root
I1015 13:44:40.098520 16485 main.cpp:128] Version: 0.20.1
I1015 13:44:40.098530 16485 main.cpp:131] Git tag: 0.20.1
I1015 13:44:40.098537 16485 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
I1015 13:44:40.098558 16485 containerizer.cpp:89] Using isolation: posix/cpu,posix/mem
I1015 13:44:40.100411 16485 main.cpp:149] Starting Mesos slave
I1015 13:44:40.101066 16485 slave.cpp:167] Slave started on 1)@127.0.1.1:5051
I1015 13:44:40.101238 16485 slave.cpp:278] Slave resources: cpus(*):4; mem(*):6649; disk(*):109050; ports(*):[31000-32000]
I1015 13:44:40.101335 16485 slave.cpp:306] Slave hostname: riri-desktop
I1015 13:44:40.101346 16485 slave.cpp:307] Slave checkpoint: true
I1015 13:44:40.102597 16489 state.cpp:33] Recovering state from '/tmp/mesos/meta'
I1015 13:44:40.102684 16489 state.cpp:62] Failed to find the latest slave from '/tmp/mesos/meta'
I1015 13:44:40.102777 16493 status_update_manager.cpp:193] Recovering status update manager
I1015 13:44:40.102821 16493 containerizer.cpp:252] Recovering containerizer
I1015 13:44:40.102982 16491 slave.cpp:3198] Finished recovery
I1015 13:44:40.103219 16488 slave.cpp:589] New master detected at [email protected]:5050
I1015 13:44:40.103313 16488 slave.cpp:625] No credentials provided. Attempting to register without authentication
I1015 13:44:40.103317 16491 status_update_manager.cpp:167] New master detected at [email protected]:5050
I1015 13:44:40.103333 16488 slave.cpp:636] Detecting new master

I1015 13:45:40.109150 16487 slave.cpp:3053] Current usage 27.72%. Max allowed age: 4.359784084743518days
I1015 13:46:40.119501 16489 slave.cpp:3053] Current usage 27.72%. Max allowed age: 4.359794862235926days

master

I1015 13:47:55.462615  5670 hierarchical_allocator_process.hpp:563] Recovered cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000] (total allocatable: cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000]) on slave 20141015-130401-16842879-5050-3432-0 from framework 20141015-134423-16842879-5050-5654-0000
I1015 13:47:58.048534  5671 http.cpp:466] HTTP request for '/master/state.json'
I1015 13:48:01.461993  5667 master.cpp:3559] Sending 1 offers to framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:01.464038  5670 master.cpp:2169] Processing reply for offers: [ 20141015-134423-16842879-5050-5654-36 ] on slave 20141015-130401-16842879-5050-3432-0 at slave(1)@127.0.1.1:5051 (james-pc.syd.local) for framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:01.464246  5670 hierarchical_allocator_process.hpp:563] Recovered cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000] (total allocatable: cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000]) on slave 20141015-130401-16842879-5050-3432-0 from framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:06.464457  5669 master.cpp:3559] Sending 1 offers to framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:06.466624  5667 master.cpp:2169] Processing reply for offers: [ 20141015-134423-16842879-5050-5654-37 ] on slave 20141015-130401-16842879-5050-3432-0 at slave(1)@127.0.1.1:5051 (james-pc.syd.local) for framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:06.466841  5671 hierarchical_allocator_process.hpp:563] Recovered cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000] (total allocatable: cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000]) on slave 20141015-130401-16842879-5050-3432-0 from framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:08.064483  5673 http.cpp:466] HTTP request for '/master/state.json'
I1015 13:48:12.465992  5674 master.cpp:3559] Sending 1 offers to framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:12.468195  5670 master.cpp:2169] Processing reply for offers: [ 20141015-134423-16842879-5050-5654-38 ] on slave 20141015-130401-16842879-5050-3432-0 at slave(1)@127.0.1.1:5051 (james-pc.syd.local) for framework 20141015-134423-16842879-5050-5654-0000
I1015 13:48:12.468408  5670 hierarchical_allocator_process.hpp:563] Recovered cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000] (total allocatable: cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000]) on slave 20141015-130401-16842879-5050-3432-0 from framework 20141015-134423-16842879-5050-5654-0000

james@james-pc:/var/log/mesos$ cat mesos-slave.james-pc.invalid-user.log.INFO.20141015-134946.6069

Log file created at: 2014/10/15 13:49:46
Running on machine: james-pc
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1015 13:49:46.323657  6069 logging.cpp:142] INFO level logging started!
I1015 13:49:46.323825  6069 main.cpp:126] Build: 2014-09-23 05:36:09 by root
I1015 13:49:46.323837  6069 main.cpp:128] Version: 0.20.1
I1015 13:49:46.323842  6069 main.cpp:131] Git tag: 0.20.1
I1015 13:49:46.323846  6069 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
I1015 13:49:46.323860  6069 containerizer.cpp:89] Using isolation: posix/cpu,posix/mem
I1015 13:49:46.324012  6069 main.cpp:149] Starting Mesos slave
I1015 13:49:46.324472  6084 slave.cpp:167] Slave started on 1)@127.0.1.1:5051
I1015 13:49:46.324604  6084 slave.cpp:278] Slave resources: cpus(*):4; mem(*):6659; disk(*):107176; ports(*):[31000-32000]
I1015 13:49:46.324697  6084 slave.cpp:306] Slave hostname: james-pc.syd.local
I1015 13:49:46.324709  6084 slave.cpp:307] Slave checkpoint: true
I1015 13:49:46.326089  6079 state.cpp:33] Recovering state from '/tmp/mesos/meta'
I1015 13:49:46.326375  6084 status_update_manager.cpp:193] Recovering status update manager
I1015 13:49:46.326452  6079 containerizer.cpp:252] Recovering containerizer
I1015 13:49:46.326608  6083 slave.cpp:3198] Finished recovery
I1015 13:49:46.327335  6084 group.cpp:313] Group process (group(1)@127.0.1.1:5051) connected to ZooKeeper
I1015 13:49:46.327352  6084 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1015 13:49:46.327360  6084 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I1015 13:49:46.328199  6085 detector.cpp:138] Detected a new leader: (id='5')
I1015 13:49:46.328272  6085 group.cpp:658] Trying to get '/mesos/info_0000000005' in ZooKeeper
I1015 13:49:46.328738  6084 detector.cpp:426] A new leading master ([email protected]:5050) is detected
I1015 13:49:46.328806  6085 slave.cpp:589] New master detected at [email protected]:5050
I1015 13:49:46.328881  6085 slave.cpp:625] No credentials provided. Attempting to register without authentication
I1015 13:49:46.328886  6078 status_update_manager.cpp:167] New master detected at [email protected]:5050
I1015 13:49:46.328897  6085 slave.cpp:636] Detecting new master
I1015 13:49:46.662595  6085 slave.cpp:816] Re-registered with master [email protected]:5050
W1015 13:50:19.134799  6078 slave.cpp:791] Already registered with master [email protected]:5050
I1015 13:50:46.338639  6082 slave.cpp:3053] Current usage 59.91%. Max allowed age: 2.106364690479491days
W1015 13:51:07.704756  6082 slave.cpp:791] Already registered with master [email protected]:5050
W1015 13:51:15.611064  6078 slave.cpp:791] Already registered with master [email protected]:5050
W1015 13:51:18.703999  6082 slave.cpp:791] Already registered with master [email protected]:5050
W1015 13:51:21.911741  6079 slave.cpp:791] Already registered with master [email protected]:5050

Solution

  • You're using local ip adresses:

    I1015 13:49:46.324472  6084 slave.cpp:167] Slave started on 1)@127.0.1.1:5051
    

    try setting those to the appropriate ips, it might be that they cannot talk to each other properly

    couple of places to look (I use the mesosphere google deploy):

    Slave (some need master IP, some slave IP):

    /etc/mesos-slave/hostname
    /etc/mesos-slave/attributes/host
    /etc/mesos/zk
    /etc/hadoop/conf/core-site.xml
    /etc/hadoop/conf/mapred-site.xml
    

    Hope it helps!