I've written a DAG to execute a number of spark tasks on a DataProc cluster. This DAG has worked without alteration in the past, but I have since had to delete and reinstall Airflow. Now, when starting the webserver I receive the following error:
AttributeError: 'DataProcSparkOperator' object has no attribute 'dataproc_spark_jars'
Documentation suggests that this object does indeed have this attribute (I can attest, since this used to work fine), and I am not sure what I need to do to get it working again.
Here's one of the tasks:
run_spark_job = dpo.DataProcSparkOperator(
task_id = 'run_spark_job',
main_class = main_class,
dataproc_spark_jars = [main_jar],
arguments=['--prop-file', '{}/{}'.format(conf_dest, conf_name), '-d', '{}'.format(date_param)],
)
There seems to be an issue with the current live version of Airflow on Pypi - on Airflow's GitHub the latest version of dataproc_operators.py
has removed the dataproc_spark_jars
attribute and replaced it with dataproc_jars
.
It's a bit ham-fisted, but I copied this version of dataproc_operators.py
over my local copy, and my issue is resolved (after renaming the attribute in my DAG of course)