Search code examples
mysqlsqldatabasewindow-functionsgaps-and-islands

MySQL combine multiple rows into one based on matching dates


Sample of my data:

+----+------------+------------+-----------+--------------+--+
| ID | startdate  |  enddate   |  status   | lengthofstay |  |
+----+------------+------------+-----------+--------------+--+
|  1 | 2018-02-15 | 2018-02-16 | transfer  |            1 |  |
|  1 | 2018-02-16 | 2018-02-22 | discharge |            6 |  |
|  2 | 2018-03-05 | 2018-03-08 | discharge |            3 |  |
|  1 | 2018-11-01 | 2018-11-03 | transfer  |            2 |  |
|  1 | 2018-11-03 | 2018-11-20 | transfer  |           17 |  |
|  1 | 2018-11-20 | 2018-11-21 | discharge |            1 |  |
|  2 | 2019-05-15 | 2019-05-20 | discharge |            5 |  |
+----+------------+------------+-----------+--------------+--+

There are two problems I would like to address. First, I would like to combine rows with matching IDs into a single row based on matching startdate and enddate, while also taking the sum of the lengthofstay (datediff between startdate and enddate). Second problem, I have repeating IDs, which entered the system at a later date on a separate occasion, that I would like to keep as a separate observation.

Here is what my ideal output would look like:

+----+------------+------------+-----------+--------------+
| ID | startdate  |  enddate   |  status   | lengthofstay |
+----+------------+------------+-----------+--------------+
|  1 | 2018-02-15 | 2018-02-22 | discharge |            7 |
|  2 | 2018-03-05 | 2018-03-08 | discharge |            3 |
|  1 | 2018-11-01 | 2018-11-21 | discharge |           20 |
|  2 | 2019-05-15 | 2019-05-20 | discharge |            5 |
+----+------------+------------+-----------+--------------+

I don't have a lot of experience in MySQL and I am not sure if this is possible to do with a join, concat or group by with rollup. I understand having repeating IDs on separate occasion is an additional problem, so I am thinking of involving another identifier based on how far apart startdate is from the last enddate for each unique ID (margin of 3 days for example), but I don't know how to do that either.

I found a similar question here but with no answer.

I appreciate any insights!


Solution

  • This is a gaps-and-island problem. Here is one way to solve it using window functions, available in MySQL 8.0:

    select
        id,
        min(startdate) startdate,
        max(enddate) enddate,
        last_status status,
        sum(lengthofstay) lengthofstay
    from (
        select
            t.*,
            last_value(status) over(partition by id, rn1 - rn2) last_status
        from (
            select
                t.*,
                row_number() over(order by startdate) rn1,
                row_number() over(partition by id order by startdate) rn2
            from mytable t
        ) t
    ) t
    group by
        id,
        last_status,
        rn1 - rn2
    order by min(startdate)
    

    The query works by ranking the records over two different partitions; the difference between the ranks gives you the group it belongs to. Then, last_value() can be used to retrieve the last status in each group. The last step is aggregation.

    Demo on DB Fiddle:

    id | startdate  | enddate    | status    | lengthofstay
    -: | :--------- | :--------- | :-------- | -----------:
     1 | 2018-02-15 | 2018-02-22 | discharge |            7
     2 | 2018-03-05 | 2018-03-08 | discharge |            3
     1 | 2018-11-01 | 2018-11-21 | discharge |           20
     2 | 2019-05-15 | 2019-05-20 | discharge |            5