Search code examples
javaapache-sparkpysparkspark-launcher

What is --archives for SparkLauncher in Java?


I am going to submit pyspark task, and submit an environment with the task.

I need --archives to submit the zip package contain full environment.

The working spark submit command is this

/my/spark/home/spark-submit
--master yarn 
--deploy-mode cluster 
--driver-memory 10G 
--executor-memory 8G 
--executor-cores 4 
--queue rnd 
--num-executors 8 
--archives /data/me/ld_env.zip#prediction_env 
--conf spark.pyspark.python=./prediction_env/ld_env/bin/python 
--conf spark.pyspark.driver.python=./prediction_env/ld_env/bin/python 
--conf spark.executor.memoryOverhead=4096 
--py-files dist/mylib-0.1.0-py3-none-any.whl my_task.py

I am trying to start spark app programatically with SparkLauncher

String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
        .setSparkHome(sparkHome)
        .setAppResource(jarPath)
        .setMaster("yarn")
        .setDeployMode("cluster")
        .setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
        .setConf(SparkLauncher.EXECUTOR_CORES, "2")
        .setConf("spark.executor.instances", "8")
        .setConf("spark.yarn.queue", "rnd")
        .setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.executor.memoryOverhead", "4096")
        .addPyFile(pyPath)
        // .addPyFile(archives) 
        // .addFile(archives)
        .addAppArgs("--inputPath",
                inputPath,
                "--outputPath",
                outputPath,
                "--option",
                option)
        .startApplication(taskListener);

I need somewhere to put my zip file that will unpack on yarn. But I don't see any archives function.


Solution

  • Use config spark.yarn.dist.archivesas document in Running on yarn and tutorial

    String pyPath = "my_task.py"
    String archives = "/data/me/ld_env.zip#prediction_env"
    SparkAppHandle handle = new SparkLauncher()
            .setSparkHome(sparkHome)
            .setAppResource(jarPath)
            .setMaster("yarn")
            .setDeployMode("cluster")
            .setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
            .setConf(SparkLauncher.EXECUTOR_CORES, "2")
            .setConf("spark.executor.instances", "8")
            .setConf("spark.yarn.queue", "rnd")
            .setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
            .setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
            .setConf("spark.executor.memoryOverhead", "4096")
            .setConf("spark.yarn.dist.archives", archives)
            .addPyFile(pyPath)
            .addAppArgs("--inputPath",
                    inputPath,
                    "--outputPath",
                    outputPath,
                    "--option",
                    option)
            .startApplication(taskListener);
    

    So, adding .setConf("spark.yarn.dist.archives", archives) fix the problem.