I'm looking for a way to do a non-equi join in SQL, joining by whether col x from table A is within a range of dates given in col y from table B. However, table B has multiple possible ranges per ID, and the table's data format is long. For example:
# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1 | 2023-03-12 | banana |
# | 1 | 2023-03-25 | apple |
# | 1 | 2023-04-01 | orange |
# | 1 | 2023-04-05 | kiwi |
# | 1 | 2023-04-22 | grapefruit |
# | 2 | 2024-12-15 | strawberry |
# | 2 | 2024-01-11 | blueberry |
# | 2 | 2024-02-12 | mango |
# | 2 | 2024-02-29 | watermelon |
# | 2 | 2024-03-10 | pear |
# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1 | 2023-03-15 | 1 | first period for user_id = 1 started
# | 1 | 2023-03-30 | 1 | on March 15 and ended on March 30.
# | 1 | 2023-04-02 | 2 |
# | 1 | 2023-04-10 | 2 |
# | 1 | 2023-04-12 | 3 |
# | 1 | 2023-04-20 | 3 |
# | 2 | 2024-01-01 | 1 |
# | 2 | 2024-01-05 | 1 |
# | 2 | 2024-02-10 | 2 | second period for user_id = 2 started
# | 2 | 2024-02-13 | 2 | on Feb 10 and ended on Feb 13.
Note about table B 👆: each person_id
could have one or many period
s, and we cannot know – i.e., we're agnostic to – how many periods per person, when writing the SQL query.
# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1 | 2023-03-25 | apple | 1 |
# | 1 | 2023-04-05 | kiwi | 2 |
# | 2 | 2024-02-12 | mango | 2 |
I work on AWS Athena that is based on Trino SQL.
WITH
table_a AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-12'), 'banana'),
(1, DATE('2023-03-25'), 'apple'),
(1, DATE('2023-04-01'), 'orange'),
(1, DATE('2023-04-05'), 'kiwi'),
(1, DATE('2023-04-22'), 'grapefruit'),
(2, DATE('2024-12-15'), 'strawberry'),
(2, DATE('2024-01-11'), 'blueberry'),
(2, DATE('2024-02-12'), 'mango'),
(2, DATE('2024-02-29'), 'watermelon'),
(2, DATE('2024-03-10'), 'pear')
) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),
table_b AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-15'), 1),
(1, DATE('2023-03-30'), 1),
(1, DATE('2023-04-02'), 2),
(1, DATE('2023-04-10'), 2),
(1, DATE('2023-04-12'), 3),
(1, DATE('2023-04-20'), 3),
(2, DATE('2024-01-01'), 1),
(2, DATE('2024-01-05'), 1),
(2, DATE('2024-02-10'), 2),
(2, DATE('2024-02-13'), 2)
) AS t(person_id, period_start_and_end, period)
)
Well, not much. The regular non-equi join procedure I'm familiar with has the start date in one column and the end date in another column. But in the current case, we have multiple periods per person_id
, and more problematic – we don't know how many. So even if I "pivot" the table to wide format, I still don't know how to figure out the multiple and unknown periods per person_id
.
You can pivot table_b to get the time ranges, that you need to join to table_a
WITH
table_a AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-12'), 'banana'),
(1, DATE('2023-03-25'), 'apple'),
(1, DATE('2023-04-01'), 'orange'),
(1, DATE('2023-04-05'), 'kiwi'),
(1, DATE('2023-04-22'), 'grapefruit'),
(2, DATE('2024-12-15'), 'strawberry'),
(2, DATE('2024-01-11'), 'blueberry'),
(2, DATE('2024-02-12'), 'mango'),
(2, DATE('2024-02-29'), 'watermelon'),
(2, DATE('2024-03-10'), 'pear')
) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),
table_b AS (
SELECT * FROM (VALUES
(1, DATE('2023-03-15'), 1),
(1, DATE('2023-03-30'), 1),
(1, DATE('2023-04-02'), 2),
(1, DATE('2023-04-10'), 2),
(1, DATE('2023-04-12'), 3),
(1, DATE('2023-04-20'), 3),
(2, DATE('2024-01-01'), 1),
(2, DATE('2024-01-05'), 1),
(2, DATE('2024-02-10'), 2),
(2, DATE('2024-02-13'), 2)
) AS t(person_id, period_start_and_end, period)
), table_b_pivot as (
SELECT person_id,MIN(period_start_and_end) from_date, MAX(period_start_and_end) as To_date, period FROM table_b
GROUP BY person_id, period)
SELECT table_a.person_id, breakfast_date, fruit_eaten_for_breakfast, period
FROM table_a JOIN table_b_pivot
ON table_a.breakfast_date BETWEEN table_b_pivot.from_date AND table_b_pivot.To_date
AND table_a.person_id = table_b_pivot.person_id
ORDER BY table_a.person_id,period
person_id | breakfast_date | fruit_eaten_for_breakfast | period |
---|---|---|---|
1 | 2023-03-25 | apple | 1 |
1 | 2023-04-05 | kiwi | 2 |
2 | 2024-02-12 | mango | 2 |