Search code examples
apache-sparkairflowgoogle-cloud-dataprocairflow-schedulergoogle-cloud-composer

How to pass Spark job properties to DataProcSparkOperator in Airflow?


I am trying to execute Spark jar on Dataproc using Airflow's DataProcSparkOperator. The jar is located on GCS, and I am creating Dataproc cluster on the fly and then executing this jar on the newly created Dataproc cluster.

I am able to execute this with DataProcSparkOperator of Airflow with default settings, but I am not able to configure Spark job properties (e.g. --master, --deploy-mode, --driver-memory etc.). From documentation of airflow didn't got any help. Also tried many things but didn't worked out. Help is appreciated.


Solution

  • To configure Spark job through DataProcSparkOperator you need to use dataproc_spark_properties parameter.

    For example, you can set deployMode like this:

    DataProcSparkOperator(
        dataproc_spark_properties={ 'spark.submit.deployMode': 'cluster' })
    

    In this answer you can find more details.