Search code examples
mysqldeadlock

Need help understanding where MySQL deadlocks


We have a long standing problem with a certain deadlock in our DB. The conflicting queries are almost always these two:

UPDATE db.tbl 
    SET pid = 68111
WHERE pid IS NULL 
AND processed IS NULL 
AND `time` <= now() 
LIMIT 10

and

UPDATE `db`.`tbl` 
    SET `processed` = -1,
        `updated_on` = NOW() 
WHERE (`tbl`.`id` = 108588129)

The pid, id and processed values vary, but the queries are the same.

What I don't understand is what exactly could deadlock there, since the second query updates a specific record by its ID. There's just one lock to be had, but you need at least 2 to create a deadlock. There are also no transactions that would lock other rows - each of these queries is standalone.

Why is this happening and how could I avoid it?

Anonymized/shortened output of show engine status innodb;:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2023-06-08 08:16:27 0x7f70f3ef1700
*** (1) TRANSACTION:
TRANSACTION 2857804352, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 2480 lock struct(s), heap size 286928, 9581 row lock(s)
MySQL thread id 25966272, OS thread handle 140144661681920, query id 847014117 x.x.x.x db_user updating
UPDATE db.tbl SET pid = ''68111'' WHERE pid IS NULL AND processed IS NULL AND `time` <= now() LIMIT 10

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804352 lock_mode X
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cf9b; asc  x  ;;

Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678cfad; asc  x  ;;

<Snip a lot of Record Locks just like the one above>

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804352 lock_mode X locks rec but not gap waiting
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
 3: len 4; hex 803ff583; asc  ?  ;;
 4: len 4; hex 73746f70; asc stop;;
 5: len 4; hex 80000002; asc     ;;
 6: len 1; hex 83; asc  ;;
 7: len 4; hex 80005737; asc   W7;;
 8: len 6; hex 4b4a38363431; asc KJ8641;;
 9: len 8; hex 80000008a515e59b; asc         ;;
 10: len 4; hex 800001be; asc     ;;
 11: len 5; hex 99b050b41a; asc   P  ;;
 12: len 4; hex 84ad8892; asc     ;;
 13: len 3; hex 736d73; asc sms;;
 14: SQL NULL;
 15: SQL NULL;
 16: len 1; hex 7f; asc  ;;
 17: SQL NULL;
 18: SQL NULL;
 19: SQL NULL;
 20: len 4; hex 53746f70; asc Stop;;
 21: len 5; hex 99b050b41b; asc   P  ;;
 22: len 5; hex 99b050b41b; asc   P  ;;


*** (2) TRANSACTION:
TRANSACTION 2857804381, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 25966282, OS thread handle 140122483259136, query id 847014368 x.x.x.x other_db_user updating
UPDATE `db`.`tbl` SET `processed` = ''-1'', `updated_on` = NOW() WHERE (`tbl`.`id` = 108588129)

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2714 page no 113955 n bits 160 index PRIMARY of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap
Record lock, heap no 92 PHYSICAL RECORD: n_fields 23; compact format; info bits 0
 0: len 4; hex 8678ec61; asc  x a;;
 1: len 6; hex 0000aa56a25d; asc    V ];;
 2: len 7; hex 0100002bc01686; asc    +   ;;
 3: len 4; hex 803ff583; asc  ?  ;;
 4: len 4; hex 73746f70; asc stop;;
 5: len 4; hex 80000002; asc     ;;
 6: len 1; hex 83; asc  ;;
 7: len 4; hex 80005737; asc   W7;;
 8: len 6; hex 4b4a38363431; asc KJ8641;;
 9: len 8; hex 80000008a515e59b; asc         ;;
 10: len 4; hex 800001be; asc     ;;
 11: len 5; hex 99b050b41a; asc   P  ;;
 12: len 4; hex 84ad8892; asc     ;;
 13: len 3; hex 736d73; asc sms;;
 14: SQL NULL;
 15: SQL NULL;
 16: len 1; hex 7f; asc  ;;
 17: SQL NULL;
 18: SQL NULL;
 19: SQL NULL;
 20: len 4; hex 53746f70; asc Stop;;
 21: len 5; hex 99b050b41b; asc   P  ;;
 22: len 5; hex 99b050b41b; asc   P  ;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2714 page no 86552 n bits 1552 index processed of table `db`.`tbl` trx id 2857804381 lock_mode X locks rec but not gap waiting
Record lock, heap no 1462 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: SQL NULL;
 1: len 4; hex 8678ec61; asc  x a;;

*** WE ROLL BACK TRANSACTION (2)

Also, a simplified table definition:

CREATE TABLE `tbl` (
  `id` int NOT NULL AUTO_INCREMENT,
  `pid` int DEFAULT NULL,
  `processed` tinyint DEFAULT NULL,
  `time` datetime DEFAULT NULL,
  `created_on` datetime DEFAULT NULL,
  `updated_on` datetime DEFAULT NULL,

  `other` varchar(10) NOT NULL,
  
  PRIMARY KEY (`id`),
  KEY `ix_other` (`other`),
  KEY `ix_processed` (`processed`),
  KEY `ix_time` (`time`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

I should also note that there are many rows (currently over 5000, but the number constantly varies during the day) with processed=null, pid=null but time in the future.

Added: A bit more context, as per comments:

There are currently a little over 5 million rows in this table. It's a sort of a queue and we get roughly 1.4M rows per month. The earliest rows are in February. When time is in the future, it's almost always in the same day.

pid and processed are related to the queue. When a row (job) is inserted, both pid and processed are null. time is when the job should be performed (some jobs need to be performed only AFTER a specified moment in time; others need to be performed ASAP).

When one of the processors is ready to execute some jobs, it starts by running the first query and "reserving" a batch of 10 jobs for itself by settings its PID (Linux process ID) to those rows. It then selects those rows and works through them one-by-one, setting the processed flag when it's done (the second query).

However sometimes those two queries clash and we get a deadlock.


Solution

  • I would suggest updating your update

    UPDATE db.tbl 
        SET pid = 68111
    WHERE pid IS NULL 
    AND processed IS NULL 
    AND `time` <= now() 
    LIMIT 10
    

    To an update where ID IN (select). At least my thinking about it. The update is locking while it is trying to gather and figure out what it SHOULD update. By pre-querying what you want as a select, and updating based on those IDs, the Select does not lock, finishes getting the few records it needs, then applies the update to them by their key ID.

    So the time index is used first for the select (should be fast), then the update is done based on the PKID returned from the select. Something like

    UPDATE db.tbl 
       SET pid = 68111
       WHERE id in ( select t2.id
                        from db.tbl t2
                        where t2.time <= now()
                          and t2.pid is null
                          and t2.processed is null 
                        limit 10 )    
    

    Now to optimize the Select query, I would have a single index to include the parts it needs without having to go to the raw data pages. So instead of your index on just the TIME column, I would have an index on

    (time, pid, processed, id )
    

    So the entire where is qualified, but also has the ID for returning the outer update WHERE ID IN condition.