Search code examples
apache-sparkgoogle-cloud-platformpysparkgoogle-cloud-dataproc

How to pass spark parameter to a dataproc workflow template?


Here's what I have:

gcloud dataproc workflow-templates create $TEMPLATE_ID --region $REGION

gcloud beta dataproc workflow-templates set-managed-cluster $TEMPLATE_ID --region $REGION --cluster-name dailyhourlygtp --image-version 1.5 
--master-machine-type=n1-standard-8 --worker-machine-type=n1-standard-16 --num-workers=10 --master-boot-disk-size=500 
--worker-boot-disk-size=500 --zone=europe-west1-b


export STEP_ID=step_pyspark1

gcloud dataproc workflow-templates add-job pyspark \
gs://$BUCKET_NAME/my_pyscript.py \
--step-id $STEP_ID \
--workflow-template $TEMPLATE_ID \
--region $REGION \
--jar=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
--initialization-actions gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar
--properties spark.jars.packages=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar

gcloud dataproc workflow-templates instantiate $TEMPLATE_ID --region=$REGION

So here the question is how do I pass the following spark parameter to my my_pyscript.py:

--master yarn     --deploy-mode cluster     --conf "spark.sql.shuffle.partitions=900" 
--conf "spark.sql.autoBroadcastJoinThreshold=10485760" --conf "spark.executor.memoryOverhead=8192" 
--conf "spark.dynamicAllocation.enabled=true" --conf "spark.shuffle.service.enabled=true" 
--executor-cores 5 --executor-memory 15g --driver-memory 16g

Solution

  • This is described in the documentation gcloud dataproc workflow-templates add-job pyspark:

    --properties=[PROPERTY=VALUE,…]
    

    List of key value pairs to configure PySpark. For a list of available properties, see: https://spark.apache.org/docs/latest/configuration.html#available-properties.

    So you would do as when you submit pyspark jobs to dataproc cluster without a template. The spark properties for the submitted job are passed as an array of key-values via --properties parameter.

    If your python job expects args, you specify them on the right of the positional argument -- and separated by spaces.

    For your example, it can be done like this:

    gcloud dataproc workflow-templates add-job pyspark \
    gs://$BUCKET_NAME/my_pyscript.py \
    --step-id $STEP_ID \
    --workflow-template $TEMPLATE_ID \
    --region $REGION \
    --properties="spark.submit.deployMode"="cluster",\
    "park.sql.shuffle.partitions"="900",\
    "spark.sql.autoBroadcastJoinThreshold"="10485760",\
    "spark.executor.memoryOverhead"="8192",\
    "spark.dynamicAllocation.enabled"="true",\
    "spark.shuffle.service.enabled"="true",\
    "spark.executor.memory"="15g",\
    "spark.driver.memory"="16g",\
    "spark.executor.cores"="5" \
    -- arg1 arg2 # for named args: -- -arg1 arg1 -arg2 arg2