Search code examples
mysqlquery-optimization

SQL query with multiple joins to the same table


I have a query (simplified to show the problem, but remove 7 more similar joins):

SELECT 
    `s`.`id`, `s`.`mobile_number`, MAX(`s`.`row_number`), `s`.`campaign_name`, `s`.`createdate`, `s`.`moddate`, 
    `se1`.`column_value` AS `first_name`, 
    `se2`.`column_value` AS `last_name`
FROM `kcms_shopper` `s`
LEFT JOIN `kcms_shopper_extend` `se1` 
    ON `s`.`mobile_number` = `se1`.`mobile_number` 
    AND `s`.`campaign_name` = `se1`.`campaign_name`
    AND `s`.`row_number` = `se1`.`row_number`
LEFT JOIN `kcms_shopper_extend` `se2` 
    ON `s`.`mobile_number` = `se2`.`mobile_number` 
    AND `s`.`campaign_name` = `se2`.`campaign_name`
    AND `s`.`row_number` = `se1`.`row_number`
WHERE `s`.`row_number` = (
    SELECT MAX(`row_number`) 
    FROM `kcms_shopper_extend` sx 
    WHERE `s`.`mobile_number` = `sx`.`mobile_number`
    AND `s`.`campaign_name` = `sx`.`campaign_name`
)
AND `se1`.`column_name` = "first_name"
AND `se2`.`column_name` = "last_name"
GROUP BY `s`.`mobile_number`, `s`.`row_number`
ORDER BY `s`.`mobile_number` ASC

Where the goal is to get data from table shopper and joining multiple times on shopper_extend.

Each shopper can have multiple rows (if they entered a campaign more than once using their mobile number) and each campaign can have a custom configured set of columns captured per campaign, hence the join table.

The structure of shopper is as follows:

CREATE TABLE `kcms_shopper` (
    `id` int(11) NOT NULL,
    `mobile_number` varchar(16) NOT NULL,
    `campaign_name` varchar(64) NOT NULL,
    `row_number` int(11) NOT NULL,
    `createdate` datetime NOT NULL DEFAULT current_timestamp(),
    `moddate` datetime NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp()
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `kcms_shopper`
    ADD PRIMARY KEY (`id`),
    ADD KEY `ix__mobile_number` (`mobile_number`) USING BTREE,
    ADD KEY `ix__campaign_name` (`campaign_name`);

ALTER TABLE `kcms_shopper`
    MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;

The structure of shopper_extend is as follows:

CREATE TABLE `kcms_shopper_extend` (
    `id` int(11) NOT NULL,
    `shopper_id` int(11) NOT NULL,
    `mobile_number` varchar(16) NOT NULL,
    `campaign_name` varchar(64) NOT NULL,
    `row_number` int(11) NOT NULL,
    `column_name` varchar(64) NOT NULL,
    `column_value` varchar(4096) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `kcms_shopper_extend`
    ADD PRIMARY KEY (`id`),
    ADD KEY `ix__column_name` (`column_name`) USING BTREE,
    ADD KEY `ix__mobile_number` (`mobile_number`),
    ADD KEY `ix__campaign_name` (`campaign_name`);

ALTER TABLE `kcms_shopper_extend`
    MODIFY `id` int(11) NOT NULL AUTO_INCREMENT;    

Please assist me in retrieving:

1. The last entry of a user (row_number)
2. For a specific campaign
3. Using a specific mobile number

The above query does not result in an error (yet), but I believe it to be wrong, because it does not complete. It hangs my MySQL for at least 10 minutes (by the time of completing this question, the query has not yet completed.


Solution

  • I see you have no indexes defined in your tables. (Edit: since your initial post of this question, you updated it with indexes.)

    Indexes are important for optimizing joins. You can verify whether a query uses indexes by analyzing the query with EXPLAIN.

    I tested your query with EXPLAIN and found it was doing joins "the hard way" which is indicated by "Using join buffer (hash join)". (Edit: I first tested the query against tables with no indexes, as they were initially described.)

    Also the subquery is a "Dependent Subquery" which means it must be executed many times, once for each distinct value it is compared to in the outer query. This is very expensive for performance.

    +----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------+
    | id | select_type        | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                                        |
    +----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------+
    |  1 | PRIMARY            | se1   | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |    25.00 | Using where; Using temporary; Using filesort |
    |  1 | PRIMARY            | s     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where; Using join buffer (hash join)   |
    |  1 | PRIMARY            | se2   | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |    25.00 | Using where; Using join buffer (hash join)   |
    |  2 | DEPENDENT SUBQUERY | sx    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    4 |    25.00 | Using where                                  |
    +----+--------------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------+
    

    I then created some indexes I thought would help:

    ALTER TABLE kcms_shopper
        ADD PRIMARY KEY (id),
        ADD INDEX bk1 (`row_number`),
        ADD INDEX bk2 (`mobile_number`, `campaign_name`, `row_number`);
    
    ALTER TABLE kcms_shopper_extend
        ADD PRIMARY KEY (id),
        ADD INDEX bk3 (mobile_number, campaign_name, `row_number`, column_name);
    

    And I refactored your subquery into another OUTER JOIN. I've used this method to implement the greatest-row-per-group type of query pattern. This allows the join to be optimized with indexes, just like any other join.

    SELECT
        `s`.`id`,
        `s`.`mobile_number`,
        MAX(`s`.`row_number`),
        `s`.`campaign_name`,
        `s`.`createdate`,
        `s`.`moddate`,
        `se1`.`column_value` AS `first_name`,
        `se2`.`column_value` AS `last_name`
    FROM `kcms_shopper` `s`
    LEFT JOIN `kcms_shopper_extend` `se1`
        ON `s`.`mobile_number` = `se1`.`mobile_number`
        AND `s`.`campaign_name` = `se1`.`campaign_name`
        AND `s`.`row_number` = `se1`.`row_number`
    LEFT JOIN `kcms_shopper_extend` `se2`
        ON `s`.`mobile_number` = `se2`.`mobile_number`
        AND `s`.`campaign_name` = `se2`.`campaign_name`
        AND `s`.`row_number` = `se2`.`row_number`
    LEFT JOIN `kcms_shopper` `sx`
        ON `s`.`mobile_number` = `sx`.`mobile_number`
        AND `s`.`campaign_name` = `sx`.`campaign_name`
        AND `s`.`row_number` < `sx`.`row_number`
    WHERE `sx`.`row_number` IS NULL
    AND `se1`.`column_name` = "first_name"
    AND `se2`.`column_name` = "last_name"
    GROUP BY `s`.`mobile_number`, `s`.`row_number`
    ORDER BY `s`.`mobile_number` ASC;
    

    The EXPLAIN analysis shows that it uses indexes for all the joins. It still has to do a table-scan for the first table, but the others are all resolved with indexed lookups (indicated by type: ref in the EXPLAIN report).

    +----+-------------+-------+------------+------+---------------+------+---------+-----------------------------------------------+------+----------+----------------------------------------------+
    | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref                                           | rows | filtered | Extra                                        |
    +----+-------------+-------+------------+------+---------------+------+---------+-----------------------------------------------+------+----------+----------------------------------------------+
    |  1 | SIMPLE      | se1   | NULL       | ALL  | bk3           | NULL | NULL    | NULL                                          |    4 |    25.00 | Using where; Using temporary; Using filesort |
    |  1 | SIMPLE      | s     | NULL       | ref  | bk1,bk2       | bk1  | 4       | test.se1.row_number                           |    1 |    25.00 | Using where                                  |
    |  1 | SIMPLE      | se2   | NULL       | ref  | bk3           | bk3  | 324     | test.se1.mobile_number,test.se1.campaign_name |    1 |    25.00 | Using index condition                        |
    |  1 | SIMPLE      | sx    | NULL       | ref  | bk1,bk2       | bk2  | 66      | test.se1.mobile_number                        |    1 |    25.00 | Using where; Not exists; Using index         |
    +----+-------------+-------+------------+------+---------------+------+---------+-----------------------------------------------+------+----------+----------------------------------------------+
    

    You might like my presentation How to Design Indexes, Really, or the video of me presenting it.

    I also wrote a chapter on indexing in my book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.