Search code examples
mysqlquery-optimization

how do I optimize a delete query to remove duplicates without creating an index?


I am running a query to delete duplicates (user defined) from a table that have ~ 3M records. The query is:

DELETE t1
FROM       'path_alias_revision' t1
INNER JOIN 'path_alias_revision' t2
WHERE t1.id < t2.id AND  t1.path=t2.path AND binary(t1.alias) = binary(t2.alias)

output of show create table:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table               | Create Table                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| path_alias_revision | CREATE TABLE `path_alias_revision` (
  `id` int(10) unsigned NOT NULL,
  `revision_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `langcode` varchar(12) CHARACTER SET ascii NOT NULL,
  `path` varchar(255) DEFAULT NULL,
  `alias` varchar(255) DEFAULT NULL,
  `status` tinyint(4) NOT NULL,
  `revision_default` tinyint(4) DEFAULT NULL,
  PRIMARY KEY (`revision_id`),
  KEY `path_alias__id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=80213807 DEFAULT CHARSET=utf8mb4 COMMENT='The revision table for path_alias entities.' |
+---------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The explain output:

explain DELETE t1 FROM  path_alias_revision t1 INNER JOIN path_alias_revision t2 WHERE t1.id < t2.id AND  t1.path=t2.path AND binary(t1.alias) = binary(t2.alias);
+----+-------------+-------+------------+------+----------------+------+---------+------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys  | key  | key_len | ref  | rows    | filtered | Extra                                          |
+----+-------------+-------+------------+------+----------------+------+---------+------+---------+----------+------------------------------------------------+
|  1 | DELETE      | t1    | NULL       | ALL  | path_alias__id | NULL | NULL    | NULL | 3105455 |   100.00 | NULL                                           |
|  1 | SIMPLE      | t2    | NULL       | ALL  | path_alias__id | NULL | NULL    | NULL | 3105455 |     3.33 | Range checked for each record (index map: 0x2) |
+----+-------------+-------+------------+------+----------------+------+---------+------+---------+----------+------------------------------------------------+

I cannot tell if the query is hanging or just taking a long time. the output for show processlist is :


MySQL [acquia]> show processlist \G;
*************************** 1. row ***************************
     Id: 11
   User: acquia
   Host: 172.18.0.3:37498
     db: acquia
Command: Query
   Time: 602
  State: Sending data
   Info: DELETE t1
FROM       path_alias_revision t1
INNER JOIN path_alias_revision t2
WHERE      t1.id < t2.
*************************** 2. row ***************************
     Id: 15
   User: acquia
   Host: 172.18.0.3:37512
     db: acquia
Command: Query
   Time: 0
  State: starting
   Info: show processlist
2 rows in set (0.000 sec)

ERROR: No query specified

What can I do to improve this query? I know I can possibly just move the data I want to retain to a temp table and rename it but I want to understand whats going on here. I have upsized some mysql properties to:

max_allowed_packet = 128M
innodb_buffer_pool_chunk_size = 128M
innodb_buffer_pool_size = 8G

But it's not helping.


Solution

  • Here's another way to solve this problem if you don't care which of your duplicates you want to keep.

    First, let's do a subquery to retrieve the PK values of the rows you actually want to keep.

                     SELECT MAX(revision_id) revision_id
                       FROM path_alias_revision
                      GROUP BY path, binary(alias)
    

    This chooses just one revision_id from each group of duplicate rows, the largest one. It generates a monster sort operation and will take a while. But the sort inherently is O(n log(n)) rather than O(n2) in complexity.

    Then you can use that subquery to run your deletes. like this.

    DELETE FROM path_alias_revision 
      WHERE revision_id NOT IN (
                     SELECT MAX(revision_id) revision_id
                       FROM path_alias_revision
                      GROUP BY path, binary(alias));
    

    Because revision_id is the primary key, the NOT IN() predicate has access to an existing index.

    If the number of rows to be deleted is modest, you can probably do this all at once. If it's vast, you'll need to chunk the deletion to avoid a too-large transaction.