Sample of my data:
+----+------------+------------+-----------+--------------+--+
| ID | startdate | enddate | status | lengthofstay | |
+----+------------+------------+-----------+--------------+--+
| 1 | 2018-02-15 | 2018-02-16 | transfer | 1 | |
| 1 | 2018-02-16 | 2018-02-22 | discharge | 6 | |
| 2 | 2018-03-05 | 2018-03-08 | discharge | 3 | |
| 1 | 2018-11-01 | 2018-11-03 | transfer | 2 | |
| 1 | 2018-11-03 | 2018-11-20 | transfer | 17 | |
| 1 | 2018-11-20 | 2018-11-21 | discharge | 1 | |
| 2 | 2019-05-15 | 2019-05-20 | discharge | 5 | |
+----+------------+------------+-----------+--------------+--+
There are two problems I would like to address. First, I would like to combine rows with matching IDs into a single row based on matching startdate and enddate, while also taking the sum of the lengthofstay (datediff between startdate and enddate). Second problem, I have repeating IDs, which entered the system at a later date on a separate occasion, that I would like to keep as a separate observation.
Here is what my ideal output would look like:
+----+------------+------------+-----------+--------------+
| ID | startdate | enddate | status | lengthofstay |
+----+------------+------------+-----------+--------------+
| 1 | 2018-02-15 | 2018-02-22 | discharge | 7 |
| 2 | 2018-03-05 | 2018-03-08 | discharge | 3 |
| 1 | 2018-11-01 | 2018-11-21 | discharge | 20 |
| 2 | 2019-05-15 | 2019-05-20 | discharge | 5 |
+----+------------+------------+-----------+--------------+
I don't have a lot of experience in MySQL and I am not sure if this is possible to do with a join, concat or group by with rollup. I understand having repeating IDs on separate occasion is an additional problem, so I am thinking of involving another identifier based on how far apart startdate is from the last enddate for each unique ID (margin of 3 days for example), but I don't know how to do that either.
I found a similar question here but with no answer.
I appreciate any insights!
This is a gaps-and-island problem. Here is one way to solve it using window functions, available in MySQL 8.0:
select
id,
min(startdate) startdate,
max(enddate) enddate,
last_status status,
sum(lengthofstay) lengthofstay
from (
select
t.*,
last_value(status) over(partition by id, rn1 - rn2) last_status
from (
select
t.*,
row_number() over(order by startdate) rn1,
row_number() over(partition by id order by startdate) rn2
from mytable t
) t
) t
group by
id,
last_status,
rn1 - rn2
order by min(startdate)
The query works by ranking the records over two different partitions; the difference between the ranks gives you the group it belongs to. Then, last_value()
can be used to retrieve the last status in each group. The last step is aggregation.
id | startdate | enddate | status | lengthofstay -: | :--------- | :--------- | :-------- | -----------: 1 | 2018-02-15 | 2018-02-22 | discharge | 7 2 | 2018-03-05 | 2018-03-08 | discharge | 3 1 | 2018-11-01 | 2018-11-21 | discharge | 20 2 | 2019-05-15 | 2019-05-20 | discharge | 5