Does anyone use MWAA in production?
We currently have around 500 DAGs running and we see an unexpected behavior with tasks staying in a "queued" state for unknown reasons.
Task is in the 'queued' state which is not a valid state for execution. The task must be cleared in order to be run.
It happens randomly, can perfectly run for a day and then a few tasks will stay queued. The tasks will stay in this state forever unless we mark them as failed manually.
A DAG run can stay in this "queued" state even if the pool is empty, I don't see any reasons explaining this.
It happens to ~5% of the tasks with all the others running smoothly.
Did you ever encounter this behavior?
This was happening to me in MWAA as well. The solution, recommended to me by AWS, was adding to Airflow configuration options via the web UI the following options:
celery.sync_parallelism = 1
core.dag_file_processor_timeout = 150
core.dagbag_import_timeout = 90
core.min_serialized_dag_update_interval = 300
scheduler.dag_dir_list_interval = 600
scheduler.min_file_process_interval = 300
scheduler.parsing_processes = 2
scheduler.processor_poll_interval = 60