I have two tables which consist of the following columns:
Table1:
ID
N
which describes an ID
who did something, e.g. for the first time, then a second time, etc.
START_DATE
is related to column N
. It describes the date when an ID
did something.
STOP_DATE
1 day before column GROUPS
changes from 0 to 1.
GROUPS
is a grouping variable.
Table2:
ID
EVENT_HAPPENED
is a date on which an event happened.
Table1:
ID | N | START_DATE | STOP_DATE | GROUPS |
---|---|---|---|---|
1 | 1 | 2023-01-26 | 2023-04-13 | 0 |
1 | 2 | 2023-02-05 | 2023-04-13 | 0 |
1 | 3 | 2023-02-27 | 2023-04-13 | 0 |
1 | 4 | 2023-04-14 | 2023-09-13 | 1 |
1 | 5 | 2023-04-14 | 2023-09-13 | 1 |
1 | 6 | 2023-04-14 | 2023-09-13 | 1 |
Table2:
ID | EVENT_HAPPENED |
---|---|
1 | 2023-03-12 |
1 | 2023-03-18 |
1 | 2023-04-07 |
1 | 2023-05-31 |
So what I need to do is to check the minimum date of EVENT_HAPPENED
. In this case, it is 2023-03-12 and then I reference it to table1 to create a column called MARKER
which indicates with 1 that for all subsequent data entries for this ID, the event has already happened.
I also need to associate the smallest date on which the EVENT_HAPPENED
occurred with the corresponding interval between START_DATE
and STOP_DATE
as a new column called HAPPENED_FIRST_TIME
with the corresponding date as column HAPPENED_ON
.
What I expect
ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
---|---|---|---|---|---|---|---|
1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 0 | null |
1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 0 | null |
1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
What I tried to do is the following:
WITH EventDates AS (
SELECT
ID,
MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
FROM
Table2
GROUP BY
ID
),
IntervalDates AS (
SELECT
t1.ID,
t1.N,
t1.START_DATE,
t1.STOP_DATE,
t1.GROUPS,
ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME
FROM
Table1 t1
LEFT JOIN
EventDates ed ON t1.ID = ed.ID
)
SELECT
ID,
N,
START_DATE,
STOP_DATE,
GROUPS,
CASE
WHEN HAPPENED_FIRST_TIME <= START_DATE THEN 1
ELSE 0
END AS MARKER,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
FROM
IntervalDates;
The problem occurs here because the smallest EVENT_HAPPENED
date is actually between START_DATE
and STOP_DATE
for the first 3 data entries. However, it happened after START_DATE
2023-02-27:
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
which gives the following result:
ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
---|---|---|---|---|---|---|---|
1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
How can I adjust my code?
Thank you
------------------EDIT
When adding ID
s who ever belong to GROUPS
= 0 the MARKER
column works fine, but the HAPPENED_FIRST_TIME
indicator does not work nor the corresponding HAPPENED_ON
.
Do I need to split ID
s into those who belong to GROUPS
= 1 and those who never do to run the RN?
Expected output:
ID | N | START_DATE | STOP_DATE | GROUPS | MARKER | HAPPENED_FIRST_TIME | HAPPENED_ON |
---|---|---|---|---|---|---|---|
1 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 0 | null |
1 | 2 | 2023-02-05 | 2023-04-13 | 0 | 0 | 0 | null |
1 | 3 | 2023-02-27 | 2023-04-13 | 0 | 0 | 1 | 2023-03-12 |
1 | 4 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 5 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
1 | 6 | 2023-04-14 | 2023-09-13 | 1 | 1 | 0 | null |
3 | 1 | 2023-01-26 | 2023-04-13 | 0 | 0 | 1 | 2023-01-31 |
3 | 2 | 2023-02-05 | 2023-04-13 | 0 | 1 | 0 | null |
3 | 3 | 2023-02-27 | 2023-04-13 | 0 | 1 | 0 | null |
3 | 4 | 2023-04-14 | 2023-04-13 | 0 | 1 | 0 | null |
I have slightly updated your query to retrieve the correct HAPPENED_FIRST_TIME, I used the window function row_number()
to pinpoint the nearest START_DATE relative to the event date :
WITH EventDates AS (
SELECT
ID,
MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
FROM
Table2
GROUP BY
ID
),
IntervalDates AS (
SELECT
t1.ID,
t1.N,
t1.START_DATE,
t1.STOP_DATE,
t1.GROUPS,
ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.N DESC)
- ROW_NUMBER() OVER (PARTITION BY t1.ID, t1.GROUPS ORDER BY t1.N DESC) AS grp
FROM
Table1 t1
LEFT JOIN
EventDates ed ON t1.ID = ed.ID
),
closest_dates AS (
SELECT ID, N, ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY N DESC) AS RN
FROM IntervalDates
WHERE HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE
)
SELECT i.ID,
i.N,
i.START_DATE,
i.STOP_DATE,
i.GROUPS,
CASE
WHEN i.HAPPENED_FIRST_TIME <= i.START_DATE THEN 1
ELSE 0
END AS MARKER,
CASE
WHEN RN = 1 AND c.ID IS NOT NULL THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN RN = 1 AND c.ID IS NOT NULL THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
FROM IntervalDates i
LEFT JOIN closest_dates c ON i.ID = c.ID and i.N = c.N
Results :
ID N START_DATE STOP_DATE GROUPS MARKER FIRST_TIME HAPPENED_ON
1 1 2023-01-26 2023-04-13 0 0 0 null
1 2 2023-02-05 2023-04-13 0 0 0 null
1 3 2023-02-27 2023-04-13 0 0 1 2023-03-12
1 4 2023-04-14 2023-09-13 1 1 0 null
1 5 2023-04-14 2023-09-13 1 1 0 null
1 6 2023-04-14 2023-09-13 1 1 0 null
Based on the recent simple data you provided in the comments, it appears that the question has transformed into a "gaps and islands" problem . This can be resolved using the difference between two row_numbers to identify the existing groups :
WITH EventDates AS (
SELECT ID, MIN(EVENT_HAPPENED) AS MIN_EVENT_DATE
FROM Table2
GROUP BY ID
),
IntervalDates AS (
SELECT t1.ID, t1.N, t1.START_DATE, t1.STOP_DATE, t1.GROUPS,
ed.MIN_EVENT_DATE AS HAPPENED_FIRST_TIME,
ROW_NUMBER() OVER (PARTITION BY t1.ID ORDER BY t1.N DESC)
- ROW_NUMBER() OVER (PARTITION BY t1.ID, t1.GROUPS ORDER BY t1.N DESC) AS grp
FROM Table1 t1
LEFT JOIN EventDates ed ON t1.ID = ed.ID
),
closest_dates AS (
SELECT ID, N, ROW_NUMBER() OVER (PARTITION BY ID, grp ORDER BY N DESC) AS RN
FROM IntervalDates
WHERE HAPPENED_FIRST_TIME BETWEEN START_DATE AND STOP_DATE
)
SELECT i.ID, i.N, i.START_DATE, i.STOP_DATE, i.GROUPS,
CASE
WHEN i.HAPPENED_FIRST_TIME <= i.START_DATE THEN 1
ELSE 0
END AS MARKER,
CASE
WHEN RN = 1 AND c.ID IS NOT NULL THEN 1
ELSE 0
END AS HAPPENED_FIRST_TIME,
CASE
WHEN RN = 1 AND c.ID IS NOT NULL THEN HAPPENED_FIRST_TIME
ELSE NULL
END AS HAPPENED_ON
FROM IntervalDates i
LEFT JOIN closest_dates c ON i.ID = c.ID and i.N = c.N