Search code examples
mariadbcorrelated-subquery

MariadB: using correlated subquery gives only one row


I have two tables c0 and m0 which I merge using a left join. The result is a table with 13 rows. I would like to add the first date of the next 40 days windows. Here are the data:

DROP TABLE IF EXISTS test0.c0;
CREATE TABLE test0.c0 (
  id      INTEGER
, cons_dt DATE
);

INSERT INTO test0.c0 VALUES
 ('1','2000-01-01')
,('1','2000-02-01')
,('1','2000-03-01')
,('1','2000-04-01')
,('1','2000-05-01')
,('1','2000-06-01')
,('1','2000-07-01')
,('1','2000-08-01')
,('1','2000-09-01')
,('1','2000-10-01')
,('1','2000-11-01')
,('1','2000-12-01')
;

DROP TABLE IF EXISTS test0.m0;
CREATE TABLE test0.m0 ( 
  id        INTEGER
, start_dt  DATE
, atc       CHAR(1)
);

INSERT INTO test0.m0 VALUES
 ('1','2000-03-01','A')
,('1','2000-04-01','A')
,('1','2000-08-01','A')
,('1','2000-08-01','B')
,('1','2000-09-01','A')
,('1','2000-10-01','B')
;

And here is the code:

SELECT c.*, m.start_dt, m.atc
      , (SELECT MIN(c.cons_dt)
         FROM test0.c0 c2
         WHERE c2.id = c.id 
                   AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 40 DAY
         ) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt

This is the output:

+----+------------+------------+-----+---------+
| id |  cons_dt   |  start_dt  | atc | stop_dt |
+----+------------+------------+-----+---------+
|  1 | 2000-03-01 | 2000-03-01 | A   | \N      |
+----+------------+------------+-----+---------+

I expected to get 13 rows with a stop date or without if the condition is not fullfilled. What's wrong with the code? I know I could use windows function but this works not for this kind of data. Consider that this is a simplified dataset. The original has many ids and various atcs.

UPDATE 1: Here are present a code which gives the correct result:

SELECT c.*, m.start_dt, m.atc, MIN(c2.cons_dt) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
LEFT JOIN test0.c0 c2 ON c.id = c2.id
                      AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
GROUP BY c.id, c.cons_dt, m.start_dt, m.atc
;

and here is the table I get and I expected:

+----+------------+------------+-----+------------+
| id |  cons_dt   |  start_dt  | atc |  stop_dt   |
+----+------------+------------+-----+------------+
|  1 | 2000-01-01 | \N         | \N  | \N         |
|  1 | 2000-02-01 | \N         | \N  | 2000-03-01 |
|  1 | 2000-03-01 | 2000-03-01 | A   | \N         |
|  1 | 2000-04-01 | 2000-04-01 | A   | 2000-05-01 |
|  1 | 2000-05-01 | \N         | \N  | \N         |
|  1 | 2000-06-01 | \N         | \N  | 2000-07-01 |
|  1 | 2000-07-01 | \N         | \N  | \N         |
|  1 | 2000-08-01 | 2000-08-01 | A   | \N         |
|  1 | 2000-08-01 | 2000-08-01 | B   | \N         |
|  1 | 2000-09-01 | 2000-09-01 | A   | 2000-10-01 |
|  1 | 2000-10-01 | 2000-10-01 | B   | \N         |
|  1 | 2000-11-01 | \N         | \N  | 2000-12-01 |
|  1 | 2000-12-01 | \N         | \N  | \N         |
+----+------------+------------+-----+------------+

I just want to understand why the subquery does not work. An interesting question would be also which performs better.

UPDATE 2: ONLY_FULL_GROUP_BY slaakso indicated what is the problem. When ONLY_FULL_GROUP_BY is enabled (it is the default) the engine restricted the use of aggregate function. Nevertheless, it is confusing since this works:

SELECT c.* ,m.atc ,m.start_dt 
     , (SELECT COUNT(*)
        FROM test0.c0 c2
        LEFT OUTER JOIN test0.m0 m2 ON c2.id = m2.id AND c2.cons_dt = m2.start_dt
        WHERE c2.id = c.id
        AND m2.atc <=> m.atc
        AND c.cons_dt > c2.cons_dt
        ) + 1 AS counter
FROM test0.c0 C
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
ORDER BY m.atc, c.cons_dt, counter
;  

and here is the output:

+----+------------+-----+------------+---------+
| id |  cons_dt   | atc |  start_dt  | counter |
+----+------------+-----+------------+---------+
|  1 | 2000-01-01 | \N  | \N         |       1 |
|  1 | 2000-02-01 | \N  | \N         |       2 |
|  1 | 2000-05-01 | \N  | \N         |       3 |
|  1 | 2000-06-01 | \N  | \N         |       4 |
|  1 | 2000-07-01 | \N  | \N         |       5 |
|  1 | 2000-11-01 | \N  | \N         |       6 |
|  1 | 2000-12-01 | \N  | \N         |       7 |
|  1 | 2000-03-01 | A   | 2000-03-01 |       1 |
|  1 | 2000-04-01 | A   | 2000-04-01 |       2 |
|  1 | 2000-08-01 | A   | 2000-08-01 |       3 |
|  1 | 2000-09-01 | A   | 2000-09-01 |       4 |
|  1 | 2000-08-01 | B   | 2000-08-01 |       1 |
|  1 | 2000-10-01 | B   | 2000-10-01 |       2 |
+----+------------+-----+------------+---------+

Why the first code does not work but the second? What makes these two codes distinct?

UPDATE 3: coming back to my initial problem

My code was wrong. MIN(c.cons_dt) in the subquery is wrong. Correct is MIN(c2.cons_dt). Now it gives the expected results. I suppose because c2.cons_dt belongs to the subquery.

SELECT c.*, m.start_dt, m.atc
      , (SELECT MIN(c2.cons_dt)
         FROM test0.c0 c2
         WHERE c.id = c2.id
               AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
         ) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
;

Here is an alternative:

SELECT c.*, m.start_dt, m.atc
      , (SELECT c2.cons_dt
         FROM test0.c0 c2
         WHERE c.id = c2.id
               AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
         ORDER BY c.id, cons_dt
         LIMIT 1
         ) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
;

Putting together: my initial code was wrong. Thanks to this I learnt sql_mode ONLY_FULL_GROUP_BY.


Solution

  • When you mix normal columns with aggregate functions, you need to include the GROUP BY clause.

    Make sure your server has ONLY_FULL_GROUP_BY mode set as it will catch many of the incorrect queries.