I'm trying to simulate a multi-node Mesos cluster using Docker and Zookeeper and trying to run a simple (py)Spark job on top of it. These Docker containers and the pyspark script are all run on the same machine. However, when I execute my Spark script, it hangs at:
No credentials provided. Attempting to register without authentication
The Mesos slave constantly outputs:
I0929 14:59:32.925915 62 slave.cpp:1959] Asked to shut down framework 20150929-143802-1224741292-5050-33-0060 by [email protected]:5050
W0929 14:59:32.926035 62 slave.cpp:1974] Cannot shut down unknown framework 20150929-143802-1224741292-5050-33-0060
And the Mesos master constantly outputs:
I0929 14:38:15.169683 39 master.cpp:2094] Received SUBSCRIBE call for framework 'test' at [email protected]:46693
I0929 14:38:15.169845 39 master.cpp:2164] Subscribing framework test with checkpointing disabled and capabilities [ ]
E0929 14:38:15.170361 42 socket.hpp:174] Shutdown failed on fd=15: Transport endpoint is not connected [107]
I0929 14:38:15.170409 36 hierarchical.hpp:391] Added framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170534 39 master.cpp:1051] Framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]:46693 disconnected
I0929 14:38:15.170549 39 master.cpp:2370] Disconnecting framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]:46693
I0929 14:38:15.170555 39 master.cpp:2394] Deactivating framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]:46693
E0929 14:38:15.170560 42 socket.hpp:174] Shutdown failed on fd=16: Transport endpoint is not connected [107]
I0929 14:38:15.170593 39 master.cpp:1075] Giving framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]:46693 0ns to failover
W0929 14:38:15.170835 41 master.cpp:4482] Master returning resources offered to framework 20150929-143802-1224741292-5050-33-0000 because the framework has terminated or is inactive
I0929 14:38:15.170855 36 hierarchical.hpp:474] Deactivated framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170990 37 hierarchical.hpp:814] Recovered cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000] (total: cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000
], allocated: ) on slave 20150929-051336-1224741292-5050-19-S0 from framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.171820 41 master.cpp:4469] Framework failover timeout, removing framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]
.1.1:46693
I0929 14:38:15.171835 41 master.cpp:5112] Removing framework 20150929-143802-1224741292-5050-33-0000 (test) at [email protected]:46693
I0929 14:38:15.172130 41 hierarchical.hpp:428] Removed framework 20150929-143802-1224741292-5050-33-0000
The Mesos master Docker image is built with the following Dockerfile
FROM ubuntu:14.04
ENV MESOS_V 0.24.0
# update
RUN apt-get update
RUN apt-get upgrade -y
# dependencies
RUN apt-get install -y wget openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev
# mesos
RUN wget http://www.apache.org/dist/mesos/${MESOS_V}/mesos-${MESOS_V}.tar.gz
RUN tar -zxf mesos-*.tar.gz
RUN rm mesos-*.tar.gz
RUN mv mesos-* mesos
WORKDIR mesos
RUN mkdir build
RUN ./configure
RUN make
RUN make install
RUN ldconfig
EXPOSE 5050
ENTRYPOINT ["/bin/bash"]
And I manually execute the mesos-master
command:
LIBPROCESS_IP=${MASTER_IP} mesos-master --registry=in_memory --ip=${MASTER_IP} --zk=zk://172.17.0.75:2181/mesos --advertise_ip=${MASTER_IP}
The Mesos slave Docker image is built using the same Dockerfile except port 5051 is exposed instead. Then I run the following command in its container:
LIBPROCESS_IP=172.17.0.72 mesos-slave --master=zk://172.17.0.75:2181/mesos
The pyspark script is:
import os
import pyspark
src = 'file:///{}/README.md'.format(os.environ['SPARK_HOME'])
leader_ip = '172.17.0.75'
conf = pyspark.SparkConf()
conf.setMaster('mesos://zk://{}:2181/mesos'.format(leader_ip))
conf.set('spark.executor.uri', 'http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz')
conf.setAppName('my_test_app')
sc = pyspark.SparkContext(conf=conf)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(' '))
word_count = (words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y))
print(word_count.collect())
Here is the complete output of the pyspark script:
15/09/29 11:07:59 INFO SparkContext: Running Spark version 1.5.0
15/09/29 11:07:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/29 11:07:59 WARN Utils: Your hostname, hubble resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface em1)
15/09/29 11:07:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/09/29 11:07:59 INFO SecurityManager: Changing view acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: Changing modify acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ftseng); users with modify permissions: Set(ftseng)
15/09/29 11:08:00 INFO Slf4jLogger: Slf4jLogger started
15/09/29 11:08:00 INFO Remoting: Starting remoting
15/09/29 11:08:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:38860]
15/09/29 11:08:00 INFO Utils: Successfully started service 'sparkDriver' on port 38860.
15/09/29 11:08:00 INFO SparkEnv: Registering MapOutputTracker
15/09/29 11:08:00 INFO SparkEnv: Registering BlockManagerMaster
15/09/29 11:08:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-28695bd2-fc83-45f4-b0a0-eefcfb80a3b5
15/09/29 11:08:00 INFO MemoryStore: MemoryStore started with capacity 530.3 MB
15/09/29 11:08:00 INFO HttpFileServer: HTTP File server directory is /tmp/spark-89444c7a-725a-4454-87db-8873f4134580/httpd-341c3da9-16d5-43a4-93ee-0e8b47389fdb
15/09/29 11:08:00 INFO HttpServer: Starting HTTP Server
15/09/29 11:08:00 INFO Utils: Successfully started service 'HTTP file server' on port 51405.
15/09/29 11:08:00 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/29 11:08:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/29 11:08:00 INFO SparkUI: Started SparkUI at http://192.168.1.2:4040
15/09/29 11:08:00 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@716: Client environment:host.name=hubble
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@724: Client environment:os.arch=3.19.0-25-generic
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@725: Client environment:os.version=#26-Ubuntu SMP Fri Jul 24 21:17:31 UTC 2015
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@733: Client environment:user.name=ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@741: Client environment:user.home=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=172.17.0.75:2181 sessionTimeout=10000 watcher=0x7fc0962b7176 sessionId=0 sessionPasswd=<null> context=0x7fc078001860 flags=0
I0929 11:08:00.651923 32328 sched.cpp:164] Version: 0.24.0
2015-09-29 11:08:00,652:32221(0x7fc06bfff700):ZOO_INFO@check_events@1703: initiated connection to server [172.17.0.75:2181]
2015-09-29 11:08:00,657:32221(0x7fc06bfff700):ZOO_INFO@check_events@1750: session establishment complete on server [172.17.0.75:2181], sessionId=0x150177fcfc40014, negotiated timeout=10000
I0929 11:08:00.658051 32322 group.cpp:331] Group process (group(1)@127.0.1.1:48692) connected to ZooKeeper
I0929 11:08:00.658083 32322 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0929 11:08:00.658100 32322 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0929 11:08:00.659600 32326 detector.cpp:156] Detected a new leader: (id='2')
I0929 11:08:00.659904 32325 group.cpp:674] Trying to get '/mesos/json.info_0000000002' in ZooKeeper
I0929 11:08:00.661052 32326 detector.cpp:481] A new leading master ([email protected]:5050) is detected
I0929 11:08:00.661201 32320 sched.cpp:262] New master detected at [email protected]:5050
I0929 11:08:00.661798 32320 sched.cpp:272] No credentials provided. Attempting to register without authentication
After a lot more experimentation, it looks like it was an issue with the IP address of the host machine (using its local network address, 192.168.xx.xx) when it should have been using its Docker host IP (172.17.xx.xx).
I managed to get things running with:
LIBPROCESS_IP=172.17.xx.xx python test_spark.py
I'm now hitting a different error, but it seems unrelated, so I think this command solves my problem.
I'm not familiar enough with Mesos/Spark yet to understand why this fixes things, so if someone wants to add an explanation, that would be very helpful.