Search code examples
apache-sparkhadoop-yarn

Launch Apache Spark application once and wait for data to process


I'm launching Apache Spark application on YARN (Hadoop). This application works correctly, but the process of waiting for acceptance and running is too long. For example: I'am going to count the words in small file (~100 words). I'm starting app:

/opt/spark/bin/spark-submit --class org.apache.spark.examples.JavaWordCount --deploy-mode cluster --master yarn --driver-memory 2g --executor-memory 2g /opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar hdfs://hadoop-master:9000/input/file.txt

and I'm waiting:
- ACCEPTED - 11s,
- RUNNING - 25s
besides few second before ACCEPTED and after RUNNING:

16/08/26 15:18:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/08/26 15:18:27 INFO client.RMProxy: Connecting to ResourceManager at hadoop-master/172.29.74.68:8032
16/08/26 15:18:27 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
16/08/26 15:18:27 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (4096 MB per container)
16/08/26 15:18:27 INFO yarn.Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
16/08/26 15:18:27 INFO yarn.Client: Setting up container launch context for our AM
16/08/26 15:18:27 INFO yarn.Client: Setting up the launch environment for our AM container
16/08/26 15:18:27 INFO yarn.Client: Preparing resources for our AM container
16/08/26 15:18:27 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/08/26 15:18:32 INFO yarn.Client: Uploading resource file:/tmp/spark-b8aa8874-9747-4c1f-8390-d0abbad019ee/__spark_libs__3386575858123884242.zip -> hdfs://hadoop-master:9000/user/root/.sparkStaging/application_1472201718061_0015/__spark_libs__3386575858123884242.zip
16/08/26 15:18:37 INFO yarn.Client: Uploading resource file:/opt/spark/examples/jars/spark-examples_2.11-2.0.0.jar -> hdfs://hadoop-master:9000/user/root/.sparkStaging/application_1472201718061_0015/spark-examples_2.11-2.0.0.jar
16/08/26 15:18:37 INFO yarn.Client: Uploading resource file:/tmp/spark-b8aa8874-9747-4c1f-8390-d0abbad019ee/__spark_conf__1130150930664135048.zip -> hdfs://hadoop-master:9000/user/root/.sparkStaging/application_1472201718061_0015/__spark_conf__.zip
16/08/26 15:18:37 INFO spark.SecurityManager: Changing view acls to: root
16/08/26 15:18:37 INFO spark.SecurityManager: Changing modify acls to: root
16/08/26 15:18:37 INFO spark.SecurityManager: Changing view acls groups to: 
16/08/26 15:18:37 INFO spark.SecurityManager: Changing modify acls groups to: 
16/08/26 15:18:37 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
16/08/26 15:18:37 INFO yarn.Client: Submitting application application_1472201718061_0015 to ResourceManager
16/08/26 15:18:37 INFO impl.YarnClientImpl: Submitted application application_1472201718061_0015
16/08/26 15:18:38 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:38 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1472217517552
     final status: UNDEFINED
     tracking URL: http://hadoop-master:8088/proxy/application_1472201718061_0015/
     user: root
16/08/26 15:18:39 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:40 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:41 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:42 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:43 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:44 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:45 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:46 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:47 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:48 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:49 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:50 INFO yarn.Client: Application report for application_1472201718061_0015 (state: ACCEPTED)
16/08/26 15:18:51 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:51 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.29.77.40
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1472217517552
     final status: UNDEFINED
     tracking URL: http://hadoop-master:8088/proxy/application_1472201718061_0015/
     user: root
16/08/26 15:18:52 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:53 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:54 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:55 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:56 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:57 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:58 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:18:59 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:00 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:01 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:02 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:03 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:04 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:05 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:06 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:07 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:08 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:09 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:10 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:11 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:12 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:13 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:14 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:15 INFO yarn.Client: Application report for application_1472201718061_0015 (state: RUNNING)
16/08/26 15:19:16 INFO yarn.Client: Application report for application_1472201718061_0015 (state: FINISHED)
16/08/26 15:19:16 INFO yarn.Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.29.77.40
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1472217517552
     final status: SUCCEEDED
     tracking URL: http://hadoop-master:8088/proxy/application_1472201718061_0015/
     user: root
16/08/26 15:19:16 INFO util.ShutdownHookManager: Shutdown hook called
16/08/26 15:19:16 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-b8aa8874-9747-4c1f-8390-d0abbad019ee

It's too long for me. I would like to launch it once and it should work and wait for data. After I give it a file, it should process data, give me a result and come back to state of waiting for next file. Is this possible to do with Apache Spark running on YARN?


Solution

  • Yes, it is possible and is called Spark Streaming that allows for doing batch-like processing in a continuous manner.