Search code examples
apache-sparkspark-submitapache-spark-standalone

Spark standalone connection driver to worker


I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration

  • master on machine 1 (port 7077 exposed)
  • worker on machine 1
  • driver on machine 2

I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText()

But when when the file is only present on worker while I'm using SparkContext.parallelize() to access it on workers, I have the following display :

INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```

that goes on and on again without actually computing the app.

This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)


Solution

  • TL;DR Make sure that spark.driver.host:spark.driver.port can be accessed from each node in the cluster.

    In general you have ensure that all nodes (both executors and master) can reach the driver.

    • In the cluster mode, where driver runs on one of the executors this is satisfied by default, as long as no ports are closed for the connections (see below).
    • In client mode machine, on which driver has been started, has to be accessible from the cluster. It means that spark.driver.host has to resolve to a publicly reachable address.

    In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.

    Furthermore:

    when when the file is only present on worker

    won't work. All inputs have to be accessible from driver, as well as, from each executor node.