Search code examples
pythonapache-sparkairflowgoogle-cloud-dataproc

AttributeError trying to load a DAG with DataProcSparkOperator tasks


I've written a DAG to execute a number of spark tasks on a DataProc cluster. This DAG has worked without alteration in the past, but I have since had to delete and reinstall Airflow. Now, when starting the webserver I receive the following error:

AttributeError: 'DataProcSparkOperator' object has no attribute 'dataproc_spark_jars'

Documentation suggests that this object does indeed have this attribute (I can attest, since this used to work fine), and I am not sure what I need to do to get it working again.

Here's one of the tasks:

run_spark_job = dpo.DataProcSparkOperator(
            task_id = 'run_spark_job',
            main_class = main_class,
            dataproc_spark_jars = [main_jar],
            arguments=['--prop-file', '{}/{}'.format(conf_dest, conf_name), '-d', '{}'.format(date_param)],
        )

Solution

  • There seems to be an issue with the current live version of Airflow on Pypi - on Airflow's GitHub the latest version of dataproc_operators.py has removed the dataproc_spark_jars attribute and replaced it with dataproc_jars.

    It's a bit ham-fisted, but I copied this version of dataproc_operators.py over my local copy, and my issue is resolved (after renaming the attribute in my DAG of course)