Search code examples
apache-sparkgoogle-cloud-platformpysparkairflowgoogle-cloud-dataproc

How to trigger google dataproc job using airflow and pass parameter as well


As a part of a DAG, I am triggering gcp pyspark dataproc job using below code,

   dag=dag,
   gcp_conn_id=gcp_conn_id,
   region=region,
   main=pyspark_script_location_gcs,
   task_id='pyspark_job_1_submit',
   cluster_name=cluster_name,
   job_name="job_1"
)

How can I pass a variable as parameter to pyspark job that can be accessible in script ?


Solution

  • You can use the paramter arguments of DataProcPySparkOperator:

    arguments (list) – Arguments for the job. (templated)

    job = DataProcPySparkOperator(
        gcp_conn_id=gcp_conn_id,
        region=region,
        main=pyspark_script_location_gcs,
        task_id='pyspark_job_1_submit',
        cluster_name=cluster_name,
        job_name="job_1",
        arguments=[
            "-arg1=arg1_value", # or just "arg1_value" for non named args
            "-arg2=arg2_value"
        ],
        dag=dag
    )