Search code examples
apache-sparkjarclasspath

What is use of method addJar() in Spark?


In a Spark job, I don't know how to import and use the jars that are shared by method SparkContext.addJar(). It seems that this method is able to move jars into some place that are accessible by other nodes in the cluster, but I do not know how to import them.
This is an example:

package utils;
    
public class addNumber {
  public int addOne(int i) {
    return i + 1;
  }

  public int addTwo(int i) {
    return i + 2;
  }
}

I create a class called addNumber and make it into a jar file utils.jar.

Then I create a Spark job with the code below:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
    
object TestDependencies {
  def main(args:Array[String]): Unit = {
    val sparkConf = new SparkConf
    val sc = new SparkContext(sparkConf)
    sc.addJar("/path/to//utils.jar")
        
    val data = 1 to 100 toList
    val rdd = sc.makeRDD(data)
        
    val rdd_1 = rdd.map(x => {
      val handler = new utils.addNumber
      handler.addOne(x)
    })
        
    rdd_1.collect().foreach { x => print(x + "||") }
  }
}

The error java.lang.NoClassDefFoundError: utils/addNumber was raised after submission of the job through spark-submit.

I know that addJar() does not guarantee jars are included into classpath of the Spark job. If I want to use the jar files, I have to move all of dependencies to the same path in each node of cluster. But if I can move and include all of the jars, what is the use of method addJar()?

I am wondering if there is a way of using jars imported by method addJar(). Thanks in advance.


Solution

  • Did you try set the path of jar with prefix "local"? From the documentation:

    public void addJar(String path)
    

    Adds a JAR dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.

    You can try with setJars as well:

    val conf = new SparkConf()
                 .setMaster('local[*]')
                 .setAppName('tmp')
                 .setJars(Array('/path1/one.jar', '/path2/two.jar'))
        
    val sc = new SparkContext(conf)
    

    and take a look here, see spark.jars option

    and set --jars param in spark-submit:

    --jars /path/1.jar,/path/2.jar
    

    or edit conf/spark-defaults.conf:

    spark.driver.extraClassPath /path/1.jar:/fullpath/2.jar
    spark.executor.extraClassPath /path/1.jar:/fullpath/2.jar