Search code examples
mysqlsqlgroup-bycross-join

MySQL numbering records by group - did I hit a bug?


I am trying to number some records in MySQL (5.5.44-0 on Ubuntu), grouped by another column (you will see what I mean below). I am adapting the solution described at Running Sums for Multiple Categories in MySQL, except I'm just numbering, not summing.

The tables involved are quite huge with almost 100 columns, so let's simplify the demonstration first by creating derived tables with only the important columns involved. Apologies for not sharing a SQL Fiddle, because it doesn't look like it's replicable unless done with big volume of data, which I cannot share:

Creating the tables:

CREATE TABLE `inquiries_test` (
  `id` int(11) NOT NULL DEFAULT '0',
  `motive` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (`inquiry_id`),
  KEY `motive` (`motive`)
);

insert into inquires_test select id, motive from inquiries;

CREATE TABLE `leads_test` (
  `lead_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `inquiry_id` int(11) DEFAULT NULL,
  KEY `id` (`lead_id`)
);

insert into leads_test select lead_id, created_at, inquiry_id;

CREATE TABLE `lead_inserts` (
  `lead_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `cnt` int(11) DEFAULT NULL
);

You will notice above that data from inquiries_test and leads_test came from the actual production tables. The importance of that will come to play later. Now populating lead_inserts:

playground>insert into lead_inserts (cnt, created_at, lead_id) 
    -> SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
    -> , l.created_at 
    -> , @id := l.lead_id as local_resouce_id
    -> FROM leads_test l join inquiries_test i on (l.inquiry_id=i.id)
    -> CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
    -> where i.motive='real' ORDER BY lead_id, created_at;
Query OK, 2172774 rows affected (14.30 sec)
Records: 2172774  Duplicates: 0  Warnings: 0

playground>select * from lead_inserts where lead_id in (117,118);
+---------+---------------------+------+
| lead_id | created_at          | cnt  |
+---------+---------------------+------+
|     117 | 2012-06-23 00:13:09 |    1 |
|     117 | 2014-09-14 04:30:37 |    2 |
|     117 | 2015-01-27 22:34:41 |    3 |
|     117 | 2015-03-19 19:33:51 |    4 |
|     118 | 2014-12-24 17:47:15 |    1 |
|     118 | 2015-01-23 21:30:09 |    2 |
|     118 | 2015-04-07 21:33:43 |    3 |
|     118 | 2015-04-10 17:00:04 |    4 |
|     118 | 2015-05-12 21:59:49 |    5 |
+---------+---------------------+------+

So far so good - the value of cnt "resets" for every new lead_id. Now given that leads_test and inquiries_tests are basically leads and inquiries with the other columns removed, it makes sense to expect that if I modify the insert statement to use the original tables, the result should be the same, right? But look:

playground>truncate table lead_inserts;
Query OK, 0 rows affected (0.14 sec)

playground>insert into lead_inserts (cnt, created_at, lead_id) 
    -> SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
    -> , l.created_at 
    -> , @id := l.lead_id as local_resouce_id
    -> FROM leads l join inquiries i on (l.inquiry_id=i.id)        
    -> CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
    -> where i.motive='real' ORDER BY lead_id, created_at;
Query OK, 2172774 rows affected (17.25 sec)
Records: 2172774  Duplicates: 0  Warnings: 0

playground>select * from lead_inserts where lead_id in (117,118);
+---------+---------------------+------+
| lead_id | created_at          | cnt  |
+---------+---------------------+------+
|     117 | 2012-06-23 00:13:09 |    1 |
|     117 | 2014-09-14 04:30:37 |    1 |
|     117 | 2015-01-27 22:34:41 |    1 |
|     117 | 2015-03-19 19:33:51 |    1 |
|     118 | 2014-12-24 17:47:15 |    1 |
|     118 | 2015-01-23 21:30:09 |    1 |
|     118 | 2015-04-07 21:33:43 |    1 |
|     118 | 2015-04-10 17:00:04 |    1 |
|     118 | 2015-05-12 21:59:49 |    1 |
+---------+---------------------+------+

What happened to the numbering? Other observations while using the original tables:

  1. If I do not process all records and specify only a few lead_id's, the computation comes out right.
  2. If I remove the INSERT clause and just run it as a select (with a LIMIT clause to just show 50 rows output), the computation comes out right.

So, is this a bug I hit, or am I missing something? In real life I cannot use the process above as a workaround - I really have to use leads and inquiries because there are other columns from those tables that have to be part of lead_inserts.

Thanks!


Solution

  • A Cha pointed out, it looks like it's a MySQL optimization thing where MySQL finds no reason to do an ORDER BY when the end result will just be inserted to a new table. Why it works for the test tables and not for the production ones, when they have the same number of rows, I don't know. But this is how I forced it to sort what will be inserted:

    First make sure that there is a concatenated index for the columns I will order by:

    CREATE INDEX idx_leads_lead_id_created ON leads(lead_id, created_at);
    

    Then force MySQL to use this index:

    insert into lead_inserts (cnt, created_at, lead_id) 
    SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
    , l.created_at 
    @id := l.lead_id as local_resouce_id
    FROM leads l FORCE INDEX FOR ORDER BY (idx_leads_lead_id_created)
    JOIN inquiries i on (l.inquiry_id=i.id)        
    CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
    WHERE i.motive='real' 
    ORDER BY lead_id, created_at;