Search code examples
amazon-web-servicesairflowairflow-2.xmwaa

Tasks in Airflow (MWAA) are going into failed state without ever running


In one of our DAGs that has many tasks, we are seeing random tasks in each DAG run where they fail with the following error:

Task is in the 'failed' state which is not a valid state for execution. 
The task must be cleared in order to be run.

The logs are empty which points to the fact that the task never started running.

The same DAG runs in our production environment without an issue. But in DEV, the DAG fails.

Thinking it might be related to the size of the environment in DEV, we updated the envrionment to match production (Class: mw1.large, schedulers: 2, Max Worker Cnt: 5, Min Worker Cnt:1). And this too did not help.

We have looked at the scheduler logs and nothing noticeable is there.

What other reasons or tips are available to determine why the task is getting failed with the error message shown above?


Solution

  • Small load in DEV might cause issues with automatic scaling as mentioned here https://docs.aws.amazon.com/mwaa/latest/userguide/mwaa-autoscaling.html You can try setting up Min Worker Cnt to the same value as Max Worker Cnt, which is 5. If it does not work, try setting up in Airflow configuration options - optional celery.worker_autoscale to 5,5. This can be done in AWS GUI via the button Edit your Airflow environment like this GUI option.

    For best practices see https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html.