Search code examples
sqlmysqldateoverlapgaps-and-islands

Transform a table based on date overlap


Problem statement

Hi there, I'm a beginner SQL user. I have an access to a database for research, there's one table of interest structured as follow:

id date use item
1 2023-02-01 90 A
1 2023-03-01 10 B
1 2023-03-15 15 C
2 2023-02-05 10 B
2 2023-02-13 30 A

This is a table of item acquisition date done by a personnel id. I want to transform this table based on the overlap of date for each personnel id, but not quite sure of how to achieve that.

Desired output

I imagine the resulting table would be as follow:

id start end item
1 1 29 A
1 30 39 A, B
1 40 43 A
1 44 58 A, C
1 59 90 A
2 1 8 B
2 9 10 A, B
2 11 38 A

Rough idea

My idea here is to:

  1. I want to transform the table using date as a starting index for each personnel id

  2. This way, we can transform the transformed date to use so that we obtain the relative ending date of uses as start + use - 1

    id start use end item
    1 1 90 90 A
    1 30 10 39 B
    1 44 15 58 C
    2 1 10 10 B
    2 9 30 38 A
  3. Visually, the item acquisition journey for each personnel would be:

    Personnel 1:

                                                  +--------------C (15)  
                                   +---------B (10)  
    +------------------------------------------------------------------------------------------A (90)  
    +------------------------------+--------------+-------------------------------------------->  
    1                              30             44                                         90  
    

    Personnel 2:

            +--------------------------------A (30)  
    +---------B (10)  
    +-------+-------------------------------->  
    1       9                              40  
    
  4. Finally, I can evaluate concurrent item acquisition as follow:

    id start end item
    1 1 29 A
    1 30 39 A, B
    1 40 43 A
    1 44 58 A, C
    1 59 90 A
    2 1 8 B
    2 9 10 A, B
    2 11 38 A

What I found

I found some relevant topics to this question, including:

What I tried

-- Create a dummy table
CREATE TABLE tbl (
  id int,
  start_date date,
  day_of_use int,
  item varchar(8)
)
;

-- Populating table with sample data
INSERT INTO tbl VALUES (1, "2023-02-01", 90, "A");
INSERT INTO tbl VALUES (1, "2023-03-02", 10, "B");
INSERT INTO tbl VALUES (1, "2023-03-15", 15, "C");
INSERT INTO tbl VALUES (2, "2023-02-05", 10, "B");
INSERT INTO tbl VALUES (2, "2023-02-13", 30, "A");

-- Querying the overlap
SELECT
  a.id,
  DATEDIFF(a.start_date, b.init) + 1 AS start,
  day_of_use,
  DATEDIFF(a.start_date, b.init) + day_of_use AS end,
  item
FROM
  tbl AS a
  LEFT JOIN
  (
    SELECT id, MIN(start_date) AS init
    FROM tbl
    GROUP BY id
    ORDER BY id
  ) AS b ON a.id = b.id
;

Though, I'm not sure how to proceed from here.. Any idea would be appreciated, thanks in advance. SQL Fiddle: https://www.db-fiddle.com/f/eB1GnZTuuH5P71kr4pWqHu/2


Solution

  • If you're using MySQL 8.0, you could try:

    • unrolling your values with a recursive cte
    • treat your problem as a gaps-and-islands problem
    • aggregate to gather your values

    In order to address the gaps and islands problem, you may want to recreate your partitions using a running sum over the amount of items change in consecutive records.

    WITH RECURSIVE cte AS (
        SELECT id, start_date AS date_, day_of_use, item
        FROM tbl
      
        UNION ALL 
      
        SELECT id, DATE_ADD(date_, INTERVAL 1 DAY), day_of_use-1, item  
        FROM cte
        WHERE day_of_use > 0
    ), cte2 AS (
        SELECT id, 
               DENSE_RANK() OVER(PARTITION BY id ORDER BY date_) AS rn, 
               item
        FROM cte
    ), cte3 AS (
        SELECT id, 
               rn,
               GROUP_CONCAT(item ORDER BY item) AS items,
               LAG(GROUP_CONCAT(item ORDER BY item)) OVER(PARTITION BY id) AS prev_items
        FROM cte2
        GROUP BY id, rn
    ), cte4 AS (
        SELECT id, rn, items,
               COUNT(CASE WHEN prev_items != items THEN 1 END) OVER(PARTITION BY id ORDER BY rn) AS parts
        FROM cte3 
    )
    SELECT id, MIN(rn) AS start_, MAX(rn) AS end_, items 
    FROM cte4
    GROUP BY id, items, parts
    

    "Output":

    id items start_ end_
    1 A 1 29
    1 A,B 30 40
    1 A 41 42
    1 A,C 43 58
    1 A 59 91
    2 B 1 8
    2 A,B 9 11
    2 A 12 39

    Check the demo here.

    Note that February 2023 doesn't have 29 days.