MySQL used in Airflow comes to a lot of deadlock

I am now encountering a lot of deadlocks when using Airflow with MySQL as the meta database, the innodb deadlock log is:

LATEST DETECTED DEADLOCK

191113 18:05:59
    (1) TRANSACTION:
TRANSACTION 5BDBA005, ACTIVE 0 sec fetching rows
mysql tables in use 2, locked 2
LOCK WAIT 20 lock struct(s), heap size 3112, 89 row lock(s)
MySQL thread id 108349930, OS thread handle 0x7f9a3aa68700, query id 5378923356 10.13.33.22 root Sending data
UPDATE task_instance, dag_run SET task_instance.state=NULL WHERE task_instance.dag_id = 'test_dag_01' AND task_instance.state IN ('queued', 'scheduled') AND dag_run.dag_id = task_instance.dag_id AND dag_run.execution_date = task_instance.execution_date AND dag_run.state != 'running'
    (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 503427 n bits 168 index `PRIMARY` of table `airflow`.`task_instance` trx id 5BDBA005 lock_mode X locks rec but not gap waiting
    (2) TRANSACTION:
TRANSACTION 5BDBA004, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 108349929, OS thread handle 0x7f9a38cec700, query id 5378923359 10.13.33.22 root Updating
UPDATE task_instance SET state='queued', queued_dttm='2019-11-13 18:05:59.392656' WHERE task_instance.dag_id = 'test_dag_01' AND task_instance.task_id = 'test_task_01_03' AND task_instance.execution_date = '2019-11-08 00:00:00'
    (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 503427 n bits 168 index `PRIMARY` of table `airflow`.`task_instance` trx id 5BDBA004 lock_mode X locks rec but not gap
    (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 417617 n bits 360 index `ti_dag_state` of table `airflow`.`task_instance` trx id 5BDBA004 lock_mode X locks rec but not gap waiting
    WE ROLL BACK TRANSACTION (2)

TRANSACTIONS

Trx id counter 5BEC503C
Purge done for trx's n:o < 5BEC4F55 undo n:o < 0
History list length 2924
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 5BEC5039, not started
MySQL thread id 108413843, OS thread handle 0x7f9a3a32e700, query id 5384428783 10.13.33.21 root
---TRANSACTION 5BEC503B, not started
MySQL thread id 108412780, OS thread handle 0x7f9a3a206700, query id 5384428790 10.13.33.20 root
---TRANSACTION 5BEC4D73, not started
MySQL thread id 108398996, OS thread handle 0x7f9a38b7a700, query id 5384425580 10.13.33.22 root
---TRANSACTION 5BE3DE1B, not started
MySQL thread id 108394511, OS thread handle 0x7f9a3af52700, query id 5381874736 10.13.33.20 root
---TRANSACTION 5BEC44D1, not started
MySQL thread id 108394257, OS thread handle 0x7f9a38d36700, query id 5384415004 10.13.33.20 root
---TRANSACTION 5BEC468B, not started
MySQL thread id 108394256, OS thread handle 0x7f9a3ad02700, query id 5384417058 10.13.33.20 root
---TRANSACTION 5BEC4696, not started
MySQL thread id 108394255, OS thread handle 0x7f9a3a4ea700, query id 5384417099 10.13.33.20 root
---TRANSACTION 5BEC3439, not started
MySQL thread id 108394254, OS thread handle 0x7f9a39754700, query id 5384394965 10.13.33.20 root
---TRANSACTION 5BE3C3B2, not started
MySQL thread id 108394242, OS thread handle 0x7f9a39426700, query id 5381837526 10.13.33.20 root
---TRANSACTION 5BEC4C50, not started
MySQL thread id 108394158, OS thread handle 0x7f9a3b07a700, query id 5384424099 10.13.33.20 root
---TRANSACTION 5BEC4E73, not started
MySQL thread id 108391224, OS thread handle 0x7f9a3aa68700, query id 5384426741 10.13.33.22 root
---TRANSACTION 5BEC3C6E, not started
MySQL thread id 108390046, OS thread handle 0x7f9a38a9c700, query id 5384404901 10.13.33.20 root
---TRANSACTION 5BEC3AD7, not started
MySQL thread id 108390031, OS thread handle 0x7f9a39bf4700, query id 5384403015 10.13.33.20 root
---TRANSACTION 5BE07D02, not started
MySQL thread id 108336566, OS thread handle 0x7f9a3a1bc700, query id 5384428791 10.13.33.20 root
show engine innodb status

END OF INNODB MONITOR OUTPUT

the 2 related tables are task_instance and dag_run, task_instance is a table that used to save the task run records, and dag_run keeps the dag records:

CREATE TABLE `task_instance` (
  `task_id` varchar(250) NOT NULL,
  `dag_id` varchar(250) NOT NULL,
  `execution_date` datetime(6) NOT NULL,
  `start_date` datetime(6) DEFAULT NULL,
  `end_date` datetime(6) DEFAULT NULL,
  `duration` float DEFAULT NULL,
  `state` varchar(20) DEFAULT NULL,
  `try_number` int(11) DEFAULT NULL,
  `hostname` varchar(1000) DEFAULT NULL,
  `unixname` varchar(1000) DEFAULT NULL,
  `job_id` int(11) DEFAULT NULL,
  `pool` varchar(50) DEFAULT NULL,
  `queue` varchar(50) DEFAULT NULL,
  `priority_weight` int(11) DEFAULT NULL,
  `operator` varchar(1000) DEFAULT NULL,
  `queued_dttm` datetime(6) DEFAULT NULL,
  `pid` int(11) DEFAULT NULL,
  `feedback` text,
  `run_type` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`task_id`,`dag_id`,`execution_date`),
  KEY `ti_dag_state` (`dag_id`,`state`),
  KEY `ti_pool` (`pool`,`state`,`priority_weight`),
  KEY `ti_state_lkp` (`dag_id`,`task_id`,`execution_date`,`state`),
  KEY `ti_state` (`state`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 


CREATE TABLE `dag_run` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `dag_id` varchar(250) DEFAULT NULL,
  `execution_date` datetime(6) DEFAULT NULL,
  `state` varchar(50) DEFAULT NULL,
  `run_id` varchar(250) DEFAULT NULL,
  `external_trigger` tinyint(1) DEFAULT NULL,
  `conf` blob,
  `end_date` datetime(6) DEFAULT NULL,
  `start_date` datetime(6) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `dag_id` (`dag_id`,`execution_date`),
  UNIQUE KEY `dag_id_2` (`dag_id`,`run_id`),
  KEY `dag_id_state` (`dag_id`,`state`)
) ENGINE=InnoDB AUTO_INCREMENT=53394 DEFAULT CHARSET=utf8

And when the DAG of Airflow becomes more and more, there are many deadlocks, which leads to a lot of retres in Airflow scheduler, I just wonder is there any ways to reduce the deadlock?

when run this sql:

explain select * from task_instance WHERE task_instance.dag_id = 'test_dag_01' AND task_instance.task_id = 'test_task_01_03' AND task_instance.execution_date = '2019-11-08 00:00:00';

it returns:

+------+-------------+---------------+-------+-----------------------------------+---------+---------+-------------------+------+-------+
| id   | select_type | table         | type  | possible_keys                     | key     | key_len | ref               | rows | Extra |
+------+-------------+---------------+-------+-----------------------------------+---------+---------+-------------------+------+-------+
|    1 | SIMPLE      | task_instance | const | PRIMARY,ti_dag_state,ti_state_lkp | PRIMARY | 1512    | const,const,const |    1 |       |
+------+-------------+---------------+-------+-----------------------------------+---------+---------+-------------------+------+-------+

Solution

Speeding up queries is a way to decrease the frequency of deadlocks. I see one more index that you need:

task_instance:  INDEX(dag_id, state)  -- got it
task_instance:  INDEX(dag_id, execution_date, state)  -- need
task_instance:  INDEX(dag_id, task_id, execution_date)  -- got it
dag_run:  INDEX(dag_id, execution_date)  -- got it

Almost all columns are NULLable. This seems unrealistic.

You have 3 unique keys on dag_run. If you don't need id, and if you can make dag_id NOT NULL and one of the other columns NOT NULL, then promote that to PRIMARY KEY.

Do you really need this much? varchar(250) -- If not, shrink them to a reasonable size. That will shrink the tables (and especially the indexes), thereby helping a little.

Having a better index also helps avoid unnecessarily locking more rows than needed.

Other queries in the transaction. What else is in each of the transactions? Perhaps a SELECT ... FOR UPDATE that locks something that we can't see from the output provided?

Deadlocks can occur when the same rows are locked in different orders by different transactions.

Normalize -- Another way to shrink tables is to replace strings with ints. Think about hostname, unixname, pool, queue, state, operator.