Search code examples
hadoopapache-sparkhadoop-yarnhadoop2

Correct way of submitting a job to YARN cluster in which job has dependencies on external jars?


I am trying to understand on what is the correct way of submitting a MR (for that matter a Spark based Java) job to YARN cluster.

Consider the situation below:

Using client machine develop code (MR or Spark) jobs, and say the codes uses 3rd party jar's. Now, when a developer has to submit the job to the YARN cluster, what is the correct way of submitting the job to cluster so that there is no run time exception of class not found. Since job is submitted as jar file, how can a developer "put" the 3rd party jars?

I am having difficulty in understanding this, can anyone help me understand this?


Solution

  • You have to simply build a "fat jar," with Gradle or Maven, that contains not only your compiled code but also all transitive dependencies.

    You can use either the Maven Assembly Plugin or any of the Gradle plugins like the Shadow Plugin.

    The output of these is what you should supply to spark-submit.