Search code examples
airflowairflow-schedulerairflow-2.xmwaa

MWAA in productions - tasks queued for unknown reasons


Does anyone use MWAA in production?

We currently have around 500 DAGs running and we see an unexpected behavior with tasks staying in a "queued" state for unknown reasons.

Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run.

It happens randomly, can perfectly run for a day and then a few tasks will stay queued. The tasks will stay in this state forever unless we mark them as failed manually.

A DAG run can stay in this "queued" state even if the pool is empty, I don't see any reasons explaining this.

It happens to ~5% of the tasks with all the others running smoothly.

Did you ever encounter this behavior?


Solution

  • This was happening to me in MWAA as well. The solution, recommended to me by AWS, was adding to Airflow configuration options via the web UI the following options:

    celery.sync_parallelism = 1 
    core.dag_file_processor_timeout = 150
    core.dagbag_import_timeout = 90 
    core.min_serialized_dag_update_interval = 300
    scheduler.dag_dir_list_interval = 600 
    scheduler.min_file_process_interval = 300
    scheduler.parsing_processes = 2 
    scheduler.processor_poll_interval = 60