Search code examples
sqlsql-servergaps-and-islands

Create dummy column that indicates when something happened between an interval


I have two tables which consist of the following columns:

Table1:

ID
N which describes an ID who did something, e.g. for the first time, then a second time, etc.
START_DATE is related to column N. It describes the date when an ID did something.
STOP_DATE 1 day before column GROUPS changes from 0 to 1.
GROUPS is a grouping variable.

Table2:

ID
EVENT_HAPPENED is a date on which an event happened.

db<>fiddle

Table1:

ID N START_DATE STOP_DATE GROUPS
1 1 2023-01-26 2023-04-13 0
1 2 2023-02-05 2023-04-13 0
1 3 2023-02-27 2023-04-13 0
1 4 2023-04-14 2023-09-13 1
1 5 2023-04-14 2023-09-13 1
1 6 2023-04-14 2023-09-13 1

Table2:

ID EVENT_HAPPENED
1 2023-03-12
1 2023-03-18
1 2023-04-07
1 2023-05-31

So what I need to do is to check the minimum date of EVENT_HAPPENED. In this case, it is 2023-03-12 and then I reference it to table1 to create a column called MARKER which indicates with 1 that for all subsequent data entries for this ID, the event has already happened. I also need to associate the smallest date on which the EVENT_HAPPENED occurred with the corresponding interval between START_DATE and STOP_DATE as a new column called HAPPENED_FIRST_TIME with the corresponding date as column HAPPENED_ON.

What I expect

ID N START_DATE STOP_DATE GROUPS MARKER HAPPENED_FIRST_TIME HAPPENED_ON
1 1 2023-01-26 2023-04-13 0 0 0 null
1 2 2023-02-05 2023-04-13 0 0 0 null
1 3 2023-02-27 2023-04-13 0 0 1 2023-03-12
1 4 2023-04-14 2023-09-13 1 1 0 null
1 5 2023-04-14 2023-09-13 1 1 0 null
1 6 2023-04-14 2023-09-13 1 1 0 null

What I tried to do is the following:

WITH EventDates AS (
    SELECT
        ID,
        MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
    FROM
        Table2
    GROUP BY
        ID
),
IntervalDates AS (
    SELECT
        t1.ID,
        t1.N,
        t1.START_DATE,
        t1.STOP_DATE,
        t1.GROUPS,
        ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME
    FROM
        Table1 t1
    LEFT JOIN
        EventDates ed ON t1.ID = ed.ID
)
SELECT
    ID,
    N,
    START_DATE,
    STOP_DATE,
    GROUPS,
    CASE
        WHEN HAPPENED_FIRST_TIME <= START_DATE THEN 1
        ELSE 0
    END AS MARKER,
    CASE
        WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
        ELSE 0
    END AS HAPPENED_FIRST_TIME,
    CASE
        WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
        ELSE NULL
    END AS HAPPENED_ON
FROM
    IntervalDates;

The problem occurs here because the smallest EVENT_HAPPENED date is actually between START_DATE and STOP_DATE for the first 3 data entries. However, it happened after START_DATE 2023-02-27:

    CASE
        WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
        ELSE 0
    END AS HAPPENED_FIRST_TIME,
    CASE
        WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
        ELSE NULL
    END AS HAPPENED_ON

which gives the following result:

ID N START_DATE STOP_DATE GROUPS MARKER HAPPENED_FIRST_TIME HAPPENED_ON
1 1 2023-01-26 2023-04-13 0 0 1 2023-03-12
1 2 2023-02-05 2023-04-13 0 0 1 2023-03-12
1 3 2023-02-27 2023-04-13 0 0 1 2023-03-12
1 4 2023-04-14 2023-09-13 1 1 0 null
1 5 2023-04-14 2023-09-13 1 1 0 null
1 6 2023-04-14 2023-09-13 1 1 0 null

How can I adjust my code?

Thank you

------------------EDIT
When adding IDs who ever belong to GROUPS = 0 the MARKER column works fine, but the HAPPENED_FIRST_TIME indicator does not work nor the corresponding HAPPENED_ON.

Do I need to split IDs into those who belong to GROUPS = 1 and those who never do to run the RN?

Expected output:

ID N START_DATE STOP_DATE GROUPS MARKER HAPPENED_FIRST_TIME HAPPENED_ON
1 1 2023-01-26 2023-04-13 0 0 0 null
1 2 2023-02-05 2023-04-13 0 0 0 null
1 3 2023-02-27 2023-04-13 0 0 1 2023-03-12
1 4 2023-04-14 2023-09-13 1 1 0 null
1 5 2023-04-14 2023-09-13 1 1 0 null
1 6 2023-04-14 2023-09-13 1 1 0 null
3 1 2023-01-26 2023-04-13 0 0 1 2023-01-31
3 2 2023-02-05 2023-04-13 0 1 0 null
3 3 2023-02-27 2023-04-13 0 1 0 null
3 4 2023-04-14 2023-04-13 0 1 0 null

Solution

  • I have slightly updated your query to retrieve the correct HAPPENED_FIRST_TIME, I used the window function row_number() to pinpoint the nearest START_DATE relative to the event date :

    WITH EventDates AS (
        SELECT
            ID,
            MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
        FROM
            Table2
        GROUP BY
            ID
    ),
    IntervalDates AS (
        SELECT
            t1.ID,
            t1.N,
            t1.START_DATE,
            t1.STOP_DATE,
            t1.GROUPS,
            ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME,
            ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.N DESC) 
            - ROW_NUMBER() OVER (PARTITION BY t1.ID, t1.GROUPS ORDER BY t1.N DESC) AS grp
        FROM
            Table1 t1
        LEFT JOIN
            EventDates ed ON t1.ID = ed.ID
    ),
    closest_dates AS (
      SELECT ID, N, ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY N DESC) AS RN
      FROM IntervalDates
      WHERE HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE
    )
    SELECT i.ID,
           i.N,
           i.START_DATE,
           i.STOP_DATE,
           i.GROUPS,
           CASE
              WHEN i.HAPPENED_FIRST_TIME <= i.START_DATE THEN 1
              ELSE 0
           END AS MARKER,
           CASE
             WHEN RN = 1 AND c.ID IS NOT NULL THEN 1
             ELSE 0
           END AS HAPPENED_FIRST_TIME,
           CASE
              WHEN RN = 1 AND c.ID IS NOT NULL THEN HAPPENED_FIRST_TIME
              ELSE NULL
           END AS HAPPENED_ON
    FROM IntervalDates i
    LEFT JOIN closest_dates c ON i.ID = c.ID and i.N = c.N 
    

    Results :

    ID  N   START_DATE  STOP_DATE   GROUPS  MARKER  FIRST_TIME  HAPPENED_ON
    1   1   2023-01-26  2023-04-13  0       0       0           null
    1   2   2023-02-05  2023-04-13  0       0       0           null
    1   3   2023-02-27  2023-04-13  0       0       1           2023-03-12
    1   4   2023-04-14  2023-09-13  1       1       0           null
    1   5   2023-04-14  2023-09-13  1       1       0           null
    1   6   2023-04-14  2023-09-13  1       1       0           null
    

    Demo here


    Based on the recent simple data you provided in the comments, it appears that the question has transformed into a "gaps and islands" problem . This can be resolved using the difference between two row_numbers to identify the existing groups :

    WITH EventDates AS (
        SELECT ID, MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
        FROM Table2
        GROUP BY ID
    ),
    IntervalDates AS (
        SELECT t1.ID, t1.N, t1.START_DATE, t1.STOP_DATE, t1.GROUPS,
            ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME,
            ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.N DESC) 
            - ROW_NUMBER() OVER (PARTITION BY t1.ID, t1.GROUPS ORDER BY t1.N DESC) AS grp
        FROM Table1 t1
        LEFT JOIN EventDates ed ON t1.ID = ed.ID
    ),
    closest_dates AS (
      SELECT ID, N, ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY N DESC) AS RN
      FROM IntervalDates
      WHERE HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE
    )
    SELECT i.ID, i.N, i.START_DATE, i.STOP_DATE, i.GROUPS,
           CASE
              WHEN i.HAPPENED_FIRST_TIME <= i.START_DATE THEN 1
              ELSE 0
           END AS MARKER,
           CASE
             WHEN RN = 1 AND c.ID IS NOT NULL THEN 1
             ELSE 0
           END AS HAPPENED_FIRST_TIME,
           CASE
              WHEN RN = 1 AND c.ID IS NOT NULL THEN HAPPENED_FIRST_TIME
              ELSE NULL
           END AS HAPPENED_ON
    FROM IntervalDates i
    LEFT JOIN closest_dates c ON i.ID = c.ID and i.N = c.N 
    

    Demo here