Search code examples
apache-sparkmesosmesosphere

spark-shell declining offers from mesos master


I have been trying to learn spark on mesos, but the spark-shell just keeps on ignoring the offers. Here is my setup:

All the components are in the same subnet

  • 1 mesos master on EC2 instance (t2.micro)

    command: mesos-master --work_dir=/tmp/abc --hostname=<public IP>

  • 2 mesos agents (each with 4 cores, 16 GB ram and 30 GB disk space)

    command: mesos-slave --master="<private IP of master>:5050" --hostname="<private IP of slave>" --work_dir=/tmp/abc

  • 1 spark-shell (client) on ec2 instance (t2.micro) I have set the following environment variables on this instance before launching the spark-shell

    export MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos.so
    export SPARK_EXECUTOR_URI=local://home/ubuntu/spark-2.1.1-bin-hadoop2.7.tgz
    

    and then I launch the the spark-shell as follows

    ./bin/spark-shell --master mesos://172.31.1.93:5050 
    

    (private IP of the master)

    I have ensured that spark-2.1.1-bin-hadoop2.7.tgz is placed in /home/ubuntu on both the agents, before starting the spark shell.

Once the spark-shell is up, I run the simplest program possible

val f = sc.textFile ("/tmp/ok.txt");
f.count()

.. and I keep getting the following logs on spark-shell

 (0 + 0) / 2]17/05/21 15:13:34 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/05/21 15:13:49 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/05/21 15:14:04 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Master side logs: (these logs I see even before doing anything inside spark-shell and they keep coming even after I have run the above code in the spark shell)

I0521 15:14:12.949108 10166 master.cpp:6992] Sending 2 offers to framework 64c1ef67-9e4f-4236-bb86-80d7aaab540f-0000 (Spark shell) at scheduler-7a375e65-7a0d-4267-befa-e69937404d5f@172.31.1.203:45596
I0521 15:14:12.955731 10164 master.cpp:4731] Processing DECLINE call for offers: [ 64c1ef67-9e4f-4236-bb86-80d7aaab540f-O34 ] for framework 64c1ef67-9e4f-4236-bb86-80d7aaab540f-0000 (Spark shell) at scheduler-7a375e65-7a0d-4267-befa-e69937404d5f@172.31.1.203:45596
I0521 15:14:12.956130 10167 master.cpp:4731] Processing DECLINE call for offers: [ 64c1ef67-9e4f-4236-bb86-80d7aaab540f-O35 ] for framework 64c1ef67-9e4f-4236-bb86-80d7aaab540f-0000 (Spark shell) at scheduler-7a375e65-7a0d-4267-befa-e69937404d5f@172.31.1.203:45596

I am using Mesos 1.2.0 and spark 2.1.1 on Ubuntu 16.04. I have verified by writing a small node.js based http client and the offers from the master seem fine. What possibly is going wrong here?


Solution

  • OK, There were two problems here.

    1. The SPARK_EXECUTOR_URI was local, so changed it to http. local I guess is for hadoop (correct me here incase).

    2. After changing the URI to local , the netty blockmanager service that runs as a part of spark executor launched by mesos-executor (as a task, coarse mode) which is launched by Mesos Containerizer which is launched by mesos-agent, used to fail trying to bind to the public IP because I had passed the hostname as the public IP to the mesos-agent, which is bound to fail in EC2. In fact, I was passing private IP at first but don't remember why I changed the hostname to public IP. Probably, to check for the sandbox logs I guess. The Mesos master was redirecting it to the mesos-agent's private IP preventing me to see the stderr logs. (I am located outside the EC2 VPC). Note, the question above has private IP being passed to the agent, which is correct. Originally, The question above was posted for the first problem.