Search code examples
apache-sparkpysparkhadoop-yarn

Spark output when submitting via Yarn cluster vs. client


I am new to Spark and just got it running on my cluster (Spark 2.0.1 on a 9 node cluster running Community version of MapR). I submit the wordcount example via

./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md

and get the following output

17/04/07 13:21:34 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
: 68
help: 1
when: 1
Hadoop: 3
...

Looks like everything is working properly. When I add --deploy-mode cluster I get the following output:

17/04/07 13:23:52 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

So no errors but I am not seeing the wordcount results. What am I missing? I see the job in my History Server and it says it completed successfully. Also I checked my user directory in DFS but no new files were written except for this empty directory: /user/myuser/.sparkStaging

Code (wordcount.py example shipped with Spark):

from __future__ import print_function
import sys
from operator import add
from pyspark.sql import SparkSession


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: wordcount <file>", file=sys.stderr)
        exit(-1)

    spark = SparkSession\
        .builder\
        .appName("PythonWordCount")\
        .getOrCreate()

    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
    counts = lines.flatMap(lambda x: x.split(' ')) \
              .map(lambda x: (x, 1)) \
              .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print("%s: %i" % (word, count))

    spark.stop()

Solution

  • The reason for your output not printing is:

    When you run in spark-client mode then the node on which you are initiating the job is the DRIVER and when you collect the result it is collected on that node and you print it.

    In yarn-cluster mode your driver is some other node not the one through which you initiated the job. So when you call the .collect function the result is collected on that and printed on that node. You can find the result being printed in the sys-out of the driver. A better approach would be to write the output somewhere in HDFS.

    The reason for your spark.yarn.jars warning is:

    In order to run a spark job yarn needs some binaries available on all the nodes of the cluster if these binaries are not available then as a part of job preparation, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

    To solve this :

    By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable(chmod 777) location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set spark.yarn.jars to hdfs:///some/path.

    After placing your jars run your code like :

    ./bin/spark-submit --master yarn --jars ~/hadoopPERMA/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar examples/src/main/python/wordcount.py ./README.md --conf spark.yarn.jars="hdfs:///some/path"
    

    Source : http://spark.apache.org/docs/latest/running-on-yarn.html