Search code examples
apache-sparkhadoop-yarn

Connecting to remote master with identity key (other auth methods?)


I'm trying to run spark-submit on the remote master, the twist is - the remote master requires identity file.

My command:

spark-submit --master spark://<ip_remote_master>:7077 --conf spark.sql.files.ignoreCorruptFiles=true --conf spark.sql.files.ignoreMissingFiles=true --driver-memory 1g --executor-memory 2g run_script.py

Error I'm getting:

21/12/15 13:01:19 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://<ip_remote_master>:7077...
21/12/15 13:01:20 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <ip_remote_master>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
        at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anon$1.run(StandaloneAppClient.scala:107)

I tried adding <ip_remote_master> to .ssh/config with the relevant *.pem file, but I suppose it was dead end, because spark connecting to master is not really ssh kind of process.

How can I make it work, while also keeping the authentication of my remote host?

Something I think is irrelevant, since I'm looking for cloud-agnostic solution - local = aws EC2, remote = aws EMR (I can ssh from one to another).


Solution

  • It sounds like you are baking in your own security and really you should be looking at sparks security. The short answer is Standalone (Master/worker) is the most manual labour to secure. In general:

    Configuring Ports for Network Security Generally speaking, a Spark cluster and its services are not deployed on the public internet. They are generally private services, and should only be accessible within the network of the organization that deploys Spark. Access to the hosts and ports used by Spark services should be limited to origin hosts that need to access the services.

    This basically is telling you to use network partitioning/network sescurity and not use "identity key". You can encrypt what is communicated and restrict who can talk to who, but you need to have the ports open between machines for them to be able to do work. If you have the need for security I'd setup an EMR cluster with Yarn, and Kerberize the cluster. Then you are very secure.