java apache-spark pyspark spark-launcher

What is --archives for SparkLauncher in Java?

I am going to submit pyspark task, and submit an environment with the task.

I need --archives to submit the zip package contain full environment.

The working spark submit command is this

/my/spark/home/spark-submit
--master yarn 
--deploy-mode cluster 
--driver-memory 10G 
--executor-memory 8G 
--executor-cores 4 
--queue rnd 
--num-executors 8 
--archives /data/me/ld_env.zip#prediction_env 
--conf spark.pyspark.python=./prediction_env/ld_env/bin/python 
--conf spark.pyspark.driver.python=./prediction_env/ld_env/bin/python 
--conf spark.executor.memoryOverhead=4096 
--py-files dist/mylib-0.1.0-py3-none-any.whl my_task.py

I am trying to start spark app programatically with SparkLauncher

String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
        .setSparkHome(sparkHome)
        .setAppResource(jarPath)
        .setMaster("yarn")
        .setDeployMode("cluster")
        .setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
        .setConf(SparkLauncher.EXECUTOR_CORES, "2")
        .setConf("spark.executor.instances", "8")
        .setConf("spark.yarn.queue", "rnd")
        .setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.executor.memoryOverhead", "4096")
        .addPyFile(pyPath)
        // .addPyFile(archives) 
        // .addFile(archives)
        .addAppArgs("--inputPath",
                inputPath,
                "--outputPath",
                outputPath,
                "--option",
                option)
        .startApplication(taskListener);

I need somewhere to put my zip file that will unpack on yarn. But I don't see any archives function.

Solution

Use config spark.yarn.dist.archivesas document in Running on yarn and tutorial

String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
        .setSparkHome(sparkHome)
        .setAppResource(jarPath)
        .setMaster("yarn")
        .setDeployMode("cluster")
        .setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
        .setConf(SparkLauncher.EXECUTOR_CORES, "2")
        .setConf("spark.executor.instances", "8")
        .setConf("spark.yarn.queue", "rnd")
        .setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
        .setConf("spark.executor.memoryOverhead", "4096")
        .setConf("spark.yarn.dist.archives", archives)
        .addPyFile(pyPath)
        .addAppArgs("--inputPath",
                inputPath,
                "--outputPath",
                outputPath,
                "--option",
                option)
        .startApplication(taskListener);

So, adding .setConf("spark.yarn.dist.archives", archives) fix the problem.