I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration
I use a test application that opens a file and counts its lines.
The application works when the file replicated on all workers and I use SparkContext.readText()
But when when the file is only present on worker while I'm using SparkContext.parallelize()
to access it on workers, I have the following display :
INFO StandaloneSchedulerBackend: Granted executor ID app-20180116210619-0007/4 on hostPort 172.17.0.3:6598 with 4 cores, 1024.0 MB RAM
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now RUNNING
INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20180116210619-0007/4 is now EXITED (Command exited with code 1)
INFO StandaloneSchedulerBackend: Executor app-20180116210619-0007/4 removed: Command exited with code 1
INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20180116210619-0007/5 on worker-20180116205132-172.17.0.3-6598 (172.17.0.3:6598) with 4 cores```
that goes on and on again without actually computing the app.
This is working when I put the driver on the same pc as the worker. So I guess there is some kind of connection to permit between so two across the network. Are you aware of a way to do that (which ports to open, which adress to add in /etc/hosts ...)
TL;DR Make sure that spark.driver.host:spark.driver.port
can be accessed from each node in the cluster.
In general you have ensure that all nodes (both executors and master) can reach the driver.
spark.driver.host
has to resolve to a publicly reachable address.In both cases you have to keep in mind, that by default driver runs on a random port. It is possible to use a fixed one by setting spark.driver.port
. Obviously this doesn't work that well, if you want to submit multiple applications at the same time.
Furthermore:
when when the file is only present on worker
won't work. All inputs have to be accessible from driver, as well as, from each executor node.