Search code examples
hadoopapache-sparkhadoop-yarnooziehue

Unable to run spark job in HUE oozie. Exception: "datanucleus-api-jdo-3.2.1.jar does not exist"


What I want to know is how/where spark picks the jars needed.

file:/mnt/md0/yarn/nm/usercache/kylin/appcache/application_1468506830246_161908/container_1468506830246_161908_01_000001/datanucleus-api-jdo-3.2.1.jar does not exist.

<spark-opts>
  --num-executors 30 
  --executor-memory 18g 
  --executor-cores 15 
  --driver-memory 2g 
  --files hdfs:///jobs/kylin/hive-site.xml 
  --jars datanucleus-api-jdo-3.2.1.jar,datanucleus-rdbms-3.2.1.jar,datanucleus-core-3.2.2.jar 
  --conf spark.shuffle.manager=tungsten-sort 
  --conf spark.shuffle.consolidateFiles=true 
  --conf spark.yarn.executor.memoryOverhead=3072 
  --conf spark.shuffle.memoryFraction=0.7 
  --conf spark.storage.memoryFraction=0.05 
  --conf spark.spot.instances=30
</spark-opts>

Solution

  • We need to provide absolute path to the jars , otherwise it will fail.

    Please check the below details from spark documentation for different ways to provide --jar.

    When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included on the driver and executor classpaths. Directory expansion does not work with --jars.

    Spark uses the following URL scheme to allow different strategies for disseminating jars:

    file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.

    hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected

    local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

    Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.

    Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages.

    For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

    Please check the link for more information.