I have two tables c0
and m0
which I merge using a left join. The result is a table with 13 rows. I would like to add the first date of the next 40 days windows. Here are the data:
DROP TABLE IF EXISTS test0.c0;
CREATE TABLE test0.c0 (
id INTEGER
, cons_dt DATE
);
INSERT INTO test0.c0 VALUES
('1','2000-01-01')
,('1','2000-02-01')
,('1','2000-03-01')
,('1','2000-04-01')
,('1','2000-05-01')
,('1','2000-06-01')
,('1','2000-07-01')
,('1','2000-08-01')
,('1','2000-09-01')
,('1','2000-10-01')
,('1','2000-11-01')
,('1','2000-12-01')
;
DROP TABLE IF EXISTS test0.m0;
CREATE TABLE test0.m0 (
id INTEGER
, start_dt DATE
, atc CHAR(1)
);
INSERT INTO test0.m0 VALUES
('1','2000-03-01','A')
,('1','2000-04-01','A')
,('1','2000-08-01','A')
,('1','2000-08-01','B')
,('1','2000-09-01','A')
,('1','2000-10-01','B')
;
And here is the code:
SELECT c.*, m.start_dt, m.atc
, (SELECT MIN(c.cons_dt)
FROM test0.c0 c2
WHERE c2.id = c.id
AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 40 DAY
) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
This is the output:
+----+------------+------------+-----+---------+
| id | cons_dt | start_dt | atc | stop_dt |
+----+------------+------------+-----+---------+
| 1 | 2000-03-01 | 2000-03-01 | A | \N |
+----+------------+------------+-----+---------+
I expected to get 13 rows with a stop date or without if the condition is not fullfilled. What's wrong with the code? I know I could use windows function but this works not for this kind of data. Consider that this is a simplified dataset. The original has many ids and various atcs.
UPDATE 1: Here are present a code which gives the correct result:
SELECT c.*, m.start_dt, m.atc, MIN(c2.cons_dt) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
LEFT JOIN test0.c0 c2 ON c.id = c2.id
AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
GROUP BY c.id, c.cons_dt, m.start_dt, m.atc
;
and here is the table I get and I expected:
+----+------------+------------+-----+------------+
| id | cons_dt | start_dt | atc | stop_dt |
+----+------------+------------+-----+------------+
| 1 | 2000-01-01 | \N | \N | \N |
| 1 | 2000-02-01 | \N | \N | 2000-03-01 |
| 1 | 2000-03-01 | 2000-03-01 | A | \N |
| 1 | 2000-04-01 | 2000-04-01 | A | 2000-05-01 |
| 1 | 2000-05-01 | \N | \N | \N |
| 1 | 2000-06-01 | \N | \N | 2000-07-01 |
| 1 | 2000-07-01 | \N | \N | \N |
| 1 | 2000-08-01 | 2000-08-01 | A | \N |
| 1 | 2000-08-01 | 2000-08-01 | B | \N |
| 1 | 2000-09-01 | 2000-09-01 | A | 2000-10-01 |
| 1 | 2000-10-01 | 2000-10-01 | B | \N |
| 1 | 2000-11-01 | \N | \N | 2000-12-01 |
| 1 | 2000-12-01 | \N | \N | \N |
+----+------------+------------+-----+------------+
I just want to understand why the subquery does not work. An interesting question would be also which performs better.
UPDATE 2: ONLY_FULL_GROUP_BY slaakso indicated what is the problem. When ONLY_FULL_GROUP_BY is enabled (it is the default) the engine restricted the use of aggregate function. Nevertheless, it is confusing since this works:
SELECT c.* ,m.atc ,m.start_dt
, (SELECT COUNT(*)
FROM test0.c0 c2
LEFT OUTER JOIN test0.m0 m2 ON c2.id = m2.id AND c2.cons_dt = m2.start_dt
WHERE c2.id = c.id
AND m2.atc <=> m.atc
AND c.cons_dt > c2.cons_dt
) + 1 AS counter
FROM test0.c0 C
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
ORDER BY m.atc, c.cons_dt, counter
;
and here is the output:
+----+------------+-----+------------+---------+
| id | cons_dt | atc | start_dt | counter |
+----+------------+-----+------------+---------+
| 1 | 2000-01-01 | \N | \N | 1 |
| 1 | 2000-02-01 | \N | \N | 2 |
| 1 | 2000-05-01 | \N | \N | 3 |
| 1 | 2000-06-01 | \N | \N | 4 |
| 1 | 2000-07-01 | \N | \N | 5 |
| 1 | 2000-11-01 | \N | \N | 6 |
| 1 | 2000-12-01 | \N | \N | 7 |
| 1 | 2000-03-01 | A | 2000-03-01 | 1 |
| 1 | 2000-04-01 | A | 2000-04-01 | 2 |
| 1 | 2000-08-01 | A | 2000-08-01 | 3 |
| 1 | 2000-09-01 | A | 2000-09-01 | 4 |
| 1 | 2000-08-01 | B | 2000-08-01 | 1 |
| 1 | 2000-10-01 | B | 2000-10-01 | 2 |
+----+------------+-----+------------+---------+
Why the first code does not work but the second? What makes these two codes distinct?
UPDATE 3: coming back to my initial problem
My code was wrong. MIN(c.cons_dt) in the subquery is wrong. Correct is MIN(c2.cons_dt). Now it gives the expected results. I suppose because c2.cons_dt belongs to the subquery.
SELECT c.*, m.start_dt, m.atc
, (SELECT MIN(c2.cons_dt)
FROM test0.c0 c2
WHERE c.id = c2.id
AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
;
Here is an alternative:
SELECT c.*, m.start_dt, m.atc
, (SELECT c2.cons_dt
FROM test0.c0 c2
WHERE c.id = c2.id
AND c2.cons_dt BETWEEN c.cons_dt + INTERVAL 1 DAY AND c.cons_dt + INTERVAL 30 DAY
ORDER BY c.id, cons_dt
LIMIT 1
) AS stop_dt
FROM test0.c0 c
LEFT OUTER JOIN test0.m0 m ON c.id = m.id AND c.cons_dt = m.start_dt
;
Putting together: my initial code was wrong. Thanks to this I learnt sql_mode ONLY_FULL_GROUP_BY
.
When you mix normal columns with aggregate functions, you need to include the GROUP BY
clause.
Make sure your server has ONLY_FULL_GROUP_BY
mode set as it will catch many of the incorrect queries.