Search code examples
mysqlgroup-byinner-joininnodb

Efficient way to compute number of matchings between two columns in MySQL


Description

I have a MySQL table like the following one:

CREATE TABLE `ticket` (
  `ticket_id` int(11) NOT NULL AUTO_INCREMENT,
  `ticket_number` varchar(30) DEFAULT NULL,
  `pick1` varchar(2) DEFAULT NULL,
  `pick2` varchar(2) DEFAULT NULL,
  `pick3` varchar(2) DEFAULT NULL,
  `pick4` varchar(2) DEFAULT NULL,
  `pick5` varchar(2) DEFAULT NULL,
  `pick6` varchar(2) DEFAULT NULL,
  PRIMARY KEY (`ticket_id`)
) ENGINE=InnoDB AUTO_INCREMENT=19675 DEFAULT CHARSET=latin1;

Let's also asume we have the following values already stored in DB:

+-----------+-------------------+-------+-------+-------+-------+-------+-------+
| ticket_id |   ticket_number   | pick1 | pick2 | pick3 | pick4 | pick5 | pick6 |
+-----------+-------------------+-------+-------+-------+-------+-------+-------+
| 655       | 08-09-21-24-46-52 | 8     | 9     | 21    | 24    | 46    | 52    |
| 658       | 08-23-24-40-42-45 | 8     | 23    | 24    | 40    | 42    | 45    |
| 660       | 07-18-19-20-22-31 | 7     | 18    | 19    | 20    | 22    | 45    |
| ...       | ...               | ...   | ...   | ...   | ...   | ...   | ...   |
| 19674     | 06-18-33-43-49-50 | 6     | 18    | 33    | 43    | 49    | 50    |
+-----------+-------------------+-------+-------+-------+-------+-------+-------+

Now, my goal is to compare each ticket with each other one in the Table (except itself), in terms of their respective values in ticket_number field (6 elements per set, split by -). Put differently, for instance, imagine I compare ticket_id = 655 with ticket_id = 658, in terms of the elements in their respectives ticket_number fields, then I will find that elements 08 and 24 appear in both sets. If we now compare ticket_id = 660 with ticket_id = 19674, then we have that there is only one coincidence: 18.

What I am actually using to carry out these comparisons is the following query:

select A.ticket_id, A.ticket_number, P.ticket_id, P.ticket_number, count(P.ticket_number) as cnt from ticket A inner join ticket P on A.ticket_id != P.ticket_id
where 
((A.ticket_number like concat("%", lpad(P.pick1,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick2,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick3,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick4,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick5,2,0), "%"))
+ (A.ticket_number like concat("%", lpad(P.pick6,2,0), "%")) > 3) group by A.ticket_id
having cnt > 5;

That is, first I create a INNER JOIN concatenating all rows with different ticket_id and then I compare each P.pickX (X=[1..6]) with the A.ticket_number of the resulting INNER JOIN operation, and I count the number of matchings between both sets.

Finally, after executing, I obtain something like this:

+-------------+-------------------+-------------+-------------------+-----+
| A.ticket_id |  A.ticket_number  | P.ticket_id |  P.ticket_number  | cnt |
+-------------+-------------------+-------------+-------------------+-----+
| 8489        | 14-21-28-32-48-49 | 2528        | 14-21-33-45-48-49 | 6   |
| 8553        | 02-14-17-38-47-53 | 2364        | 02-30-38-44-47-53 | 6   |
| 8615        | 05-12-29-33-36-43 | 4654        | 12-21-29-33-36-37 | 6   |
| 8686        | 09-13-29-34-44-48 | 6038        | 09-13-17-29-33-44 | 6   |
| 8693        | 01-10-14-17-42-50 | 5330        | 01-10-37-42-48-50 | 6   |
| ...         | ...               | ...         | ...               | ... |
| 19195       | 05-13-29-41-46-51 | 5106        | 07-13-14-29-41-51 | 6   |
+-------------+-------------------+-------------+-------------------+-----+ 

Problem

The problem is that I execute this for a table of 10476 rows, resulting in more tan 100 Million ticket_number vs pickX to compare, lasting around 172 seconds in total to conclude. This is too slow.

GOAL

My goal is to make this execution as fast as possible so as to be completed in less than a second, since this must work in real-time.

Is that possible?


Solution

  • If you want to keep the current structure then change pick1..6 to tinyint type instead of varchar

    TINYINT(1) stores the values between -128 to 128 if it is signed. And then your query won't have that concat with % which is the cause of slow run.

    Then, these two queries will give you the same result

    select * FROM ticket where pick1 = '8';
    select * FROM ticket where pick1 = '08';
    

    This is the sql structure:

    CREATE TABLE `ticket` (
      `ticket_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
      `ticket_number` varchar(30) DEFAULT NULL,
      `pick1` tinyint(1) unsigned zerofill DEFAULT NULL,
      `pick2` tinyint(1) unsigned zerofill DEFAULT NULL,
      `pick3` tinyint(1) unsigned zerofill DEFAULT NULL,
      `pick4` tinyint(1) unsigned zerofill DEFAULT NULL,
      `pick5` tinyint(1) unsigned zerofill DEFAULT NULL,
      `pick6` tinyint(1) unsigned zerofill DEFAULT NULL,
        PRIMARY KEY (`ticket_id`)
      ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1;
    

    I think, you even can remove the zerofill

    if this doesn't work, change the table design.