mysql database-replication percona multi-master-replication gtid

Deadlock on MTS replication

Situation:

we have a master-master-replication using GTIDs on Percona MySQL 5.6.32-78.1. On server, there are about 10 databases and we've set slave_parallel_workers=5. One server is used for frontend handling and one for backend. Two or three times a week, the replication on the backend server dies with error

2016-10-25 10:00:01 165238 [Warning] Slave SQL: Worker 4 failed executing transaction '0e7b97a8-a689-11e5-8b79-901b0e8b0f53:22506262' at master log mysql-bin.011888, end_log_pos 9306420; Could not execute Update_rows event on table shop.sessions; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log mysql-bin.011888, end_log_pos 9306420, Error_code: 1213 2016-10-25 10:00:01 165238 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756 2016-10-25 10:00:01 165238 [Note] Error reading relay log event: slave SQL thread was killed

What could be the reason? There are no cross-database DML statements and I thought by using MTS, only one thread is used per database (the benefit of MTS is using parallel replication across several databases)? Why does a repliation break with a deadlock?

EDIT 2016-10-28:

Schema of table looks like

CREATE TABLE `sessions` (
  `id` int(11) NOT NULL,
  `session_id` char(40) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `crypt_iv` blob NOT NULL,
  `data` mediumblob NOT NULL,
  `user_id` int(11) NOT NULL,
  `last_refresh` datetime NOT NULL,
  `timeout` datetime NOT NULL,
  `closed` tinyint(4) NOT NULL,
  `inserted` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `sessions`
  ADD PRIMARY KEY (`id`),
  ADD UNIQUE KEY `session_id` (`session_id`),
  ADD KEY `user_id` (`user_id`),
  ADD KEY `timeout` (`timeout`);
ALTER TABLE `sessions` MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;

At time this error has only happened on backend side, never on frontend server. At the moment I cannot paste the exact statement as binary logs are purged. But the only statement inside this GTID transaction is a row-based UPDATE on the table.

Solution

I guess all sessions are created on the frontend server. Is there maybe a session cleanup job on the backend server? So you have writes on the table from both machines. If you have a write heavy table as sessions you should only write it on one machine to avoid this kind of deadlocks.

Actually you should always do all writes on one machine only, except for failover cases, when one master goes down.

There are nice setups with haproxy and health checks to have the failover handled automatically and transparent for your clients.