amazon-web-services apache-spark cloud hdfs emr

Submit a Spark application via AWS [EMR]

Hello I'm very new to cloud computing so I apologize for (maybe) the stupid question. I need help to know if what i do is actually computing on the cluster or just on the master (useless thing).

WHAT I CAN DO: Well I can set up a cluster of a certain number of nodes with Spark installed on all of them using the AWS console. I can connect to the master node via SSH. What is it required then it to run my jar with Spark code on the cluster.

WHAT I'D DO: I'd call spark-submit to run my code:

spark-submit --class cc.Main /home/ubuntu/MySparkCode.jar 3 [arguments]

MY DOUBTS:

Is it needed to specify the master with --master and the "spark://" reference of the master? Where could I find that reference? Should I run the script in sbin/start-master.sh to start a Standalone cluster manager or is it already set? If I run the code above I imagine that code would run only locally on the master, right?
Can I keep my input files only on the master node? Suppose i want to count the words of a huge text file, can I keep it only on the disk of the master? Or to maintain the parallelism I need a distributed memory like HDFS? I don't understand this, I'd keep it on the master node disk if it fits.

So thanks for the reply.

UPDATE1: I tried to run the Pi example on the cluster and I can't get the result.

$ sudo spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn   --deploy-mode cluster   /usr/lib/spark/examples/jars/spark-examples.jar   10

I would expect to get a line with printed Pi is roughly 3.14... instead I get:

17/04/15 13:16:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/15 13:16:03 INFO RMProxy: Connecting to ResourceManager at ip-172-31-37-222.us-west-2.compute.internal/172.31.37.222:8032
17/04/15 13:16:03 INFO Client: Requesting a new application from cluster with 2 NodeManagers 
17/04/15 13:16:03 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
17/04/15 13:16:03 INFO Client: Will allocate AM container, with 5120 MB memory including 465 MB overhead
17/04/15 13:16:03 INFO Client: Setting up container launch context for our AM
17/04/15 13:16:03 INFO Client: Setting up the launch environment for our AM container
17/04/15 13:16:03 INFO Client: Preparing resources for our AM container
17/04/15 13:16:06 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/04/15 13:16:10 INFO Client: Uploading resource file:/mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9/__spark_libs__5838015067814081789.zip -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/__spark_libs__5838015067814081789.zip
17/04/15 13:16:12 INFO Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples.jar -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/spark-examples.jar
17/04/15 13:16:12 INFO Client: Uploading resource file:/mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9/__spark_conf__1370316719712336297.zip -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/__spark_conf__.zip
17/04/15 13:16:13 INFO SecurityManager: Changing view acls to: root
17/04/15 13:16:13 INFO SecurityManager: Changing modify acls to: root
17/04/15 13:16:13 INFO SecurityManager: Changing view acls groups to: 
17/04/15 13:16:13 INFO SecurityManager: Changing modify acls groups to: 
17/04/15 13:16:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()

17/04/15 13:16:13 INFO Client: Submitting application application_1492261407069_0007 to ResourceManager
17/04/15 13:16:13 INFO YarnClientImpl: Submitted application application_1492261407069_0007
17/04/15 13:16:14 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:14 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1492262173096
     final status: UNDEFINED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:15 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:24 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:25 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:25 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.31.33.215
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1492262173096
     final status: UNDEFINED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:26 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:55 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:56 INFO Client: Application report for application_1492261407069_0007 (state: FINISHED)
17/04/15 13:16:56 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.31.33.215
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1492262173096
     final status: SUCCEEDED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:56 INFO ShutdownHookManager: Shutdown hook called
17/04/15 13:16:56 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9

Solution

Answer for 1st doubt:

I am assuming that you want to run spark on yarn. You can just pass --master yarn --deploy-mode cluster, the Spark driver runs inside an application master process which is managed by YARN on the cluster

spark-submit --master yarn  --deploy-mode cluster \
    --class cc.Main /home/ubuntu/MySparkCode.jar 3 [arguments]

Reference for other modes

when you run the job on --deploy-mode cluster you don't see the output(if you are printing something) on the machine where you run.

Reason: you are running a job on cluster mode hence master will be running on one of the nodes in cluster and output will be emitted on the same machine.

To check output you can see it in application log using the following command.

yarn logs -applicationId application_id

Answer for 2nd doubt:

You can keep your input files anywhere(master node/HDFS).

parallelism completely depends on the number of partitions of RDD/DataFrame created when you load data. the number of partitions depends on data size though you can control by passing parameters when you load data.

if you are loading data from master:

 val rdd =   sc.textFile("/home/ubumtu/input.txt",[number of partitions])

rdd will be created with the number of partitions you passed. if you do not pass a number of partitions then it will consider spark.default.parallelism configured in spark conf.

if you are loading data from HDFS:

 val rdd =  sc.textFile("hdfs://namenode:8020/data/input.txt")

rdd will be created with the number of partitions which is equal to number blocks inside HDFS.

Hope my answers helps you.