Search code examples
sqljoinamazon-athenaprestotrino

How to do a non-equi join in SQL with multiple time ranges per group?


I'm looking for a way to do a non-equi join in SQL, joining by whether col x from table A is within a range of dates given in col y from table B. However, table B has multiple possible ranges per ID, and the table's data format is long. For example:

# Table A
# | person_id | breakfast_date | fruit_eaten_for_breakfast |
# |-----------|----------------|---------------------------|
# | 1         | 2023-03-12     | banana                    |
# | 1         | 2023-03-25     | apple                     |
# | 1         | 2023-04-01     | orange                    |
# | 1         | 2023-04-05     | kiwi                      |
# | 1         | 2023-04-22     | grapefruit                |
# | 2         | 2024-12-15     | strawberry                |
# | 2         | 2024-01-11     | blueberry                 |
# | 2         | 2024-02-12     | mango                     |
# | 2         | 2024-02-29     | watermelon                |
# | 2         | 2024-03-10     | pear                      |


# Table B
# | person_id | period_start_and_end | period |
# |-----------|----------------------|--------|
# | 1         | 2023-03-15           | 1      | first period for user_id = 1 started
# | 1         | 2023-03-30           | 1      | on March 15 and ended on March 30.
# | 1         | 2023-04-02           | 2      |
# | 1         | 2023-04-10           | 2      |
# | 1         | 2023-04-12           | 3      |
# | 1         | 2023-04-20           | 3      |
# | 2         | 2024-01-01           | 1      |
# | 2         | 2024-01-05           | 1      |
# | 2         | 2024-02-10           | 2      | second period for user_id = 2 started
# | 2         | 2024-02-13           | 2      | on Feb 10 and ended on Feb 13.

Note about table B 👆: each person_id could have one or many periods, and we cannot know – i.e., we're agnostic to – how many periods per person, when writing the SQL query.

Expected Output

# | person_id | breakfast_date | fruit_eaten_for_breakfast | period |
# |-----------|----------------|---------------------------|--------|
# | 1         | 2023-03-25     | apple                     | 1      |
# | 1         | 2023-04-05     | kiwi                      | 2      |
# | 2         | 2024-02-12     | mango                     | 2      |

SQL Dialect

I work on AWS Athena that is based on Trino SQL.

Reproducible data

WITH

table_a AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-12'), 'banana'),
    (1, DATE('2023-03-25'), 'apple'),
    (1, DATE('2023-04-01'), 'orange'),
    (1, DATE('2023-04-05'), 'kiwi'),
    (1, DATE('2023-04-22'), 'grapefruit'),
    (2, DATE('2024-12-15'), 'strawberry'),
    (2, DATE('2024-01-11'), 'blueberry'),
    (2, DATE('2024-02-12'), 'mango'),
    (2, DATE('2024-02-29'), 'watermelon'),
    (2, DATE('2024-03-10'), 'pear')
  ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
),

table_b AS (
  SELECT * FROM (VALUES
    (1, DATE('2023-03-15'), 1),
    (1, DATE('2023-03-30'), 1),
    (1, DATE('2023-04-02'), 2),
    (1, DATE('2023-04-10'), 2),
    (1, DATE('2023-04-12'), 3),
    (1, DATE('2023-04-20'), 3),
    (2, DATE('2024-01-01'), 1),
    (2, DATE('2024-01-05'), 1),
    (2, DATE('2024-02-10'), 2),
    (2, DATE('2024-02-13'), 2)
  ) AS t(person_id, period_start_and_end, period)
)

What I have tried so far

Well, not much. The regular non-equi join procedure I'm familiar with has the start date in one column and the end date in another column. But in the current case, we have multiple periods per person_id, and more problematic – we don't know how many. So even if I "pivot" the table to wide format, I still don't know how to figure out the multiple and unknown periods per person_id.


Solution

  • You can pivot table_b to get the time ranges, that you need to join to table_a

    WITH
    
    table_a AS (
      SELECT * FROM (VALUES
        (1, DATE('2023-03-12'), 'banana'),
        (1, DATE('2023-03-25'), 'apple'),
        (1, DATE('2023-04-01'), 'orange'),
        (1, DATE('2023-04-05'), 'kiwi'),
        (1, DATE('2023-04-22'), 'grapefruit'),
        (2, DATE('2024-12-15'), 'strawberry'),
        (2, DATE('2024-01-11'), 'blueberry'),
        (2, DATE('2024-02-12'), 'mango'),
        (2, DATE('2024-02-29'), 'watermelon'),
        (2, DATE('2024-03-10'), 'pear')
      ) AS t(person_id, breakfast_date, fruit_eaten_for_breakfast)
    ),
    
    table_b AS (
      SELECT * FROM (VALUES
        (1, DATE('2023-03-15'), 1),
        (1, DATE('2023-03-30'), 1),
        (1, DATE('2023-04-02'), 2),
        (1, DATE('2023-04-10'), 2),
        (1, DATE('2023-04-12'), 3),
        (1, DATE('2023-04-20'), 3),
        (2, DATE('2024-01-01'), 1),
        (2, DATE('2024-01-05'), 1),
        (2, DATE('2024-02-10'), 2),
        (2, DATE('2024-02-13'), 2)
      ) AS t(person_id, period_start_and_end, period)
    ), table_b_pivot as (
    SELECT person_id,MIN(period_start_and_end) from_date, MAX(period_start_and_end) as To_date, period FROM table_b
    GROUP BY person_id, period)
    SELECT table_a.person_id, breakfast_date, fruit_eaten_for_breakfast, period
    FROM table_a JOIN table_b_pivot 
      ON table_a.breakfast_date BETWEEN table_b_pivot.from_date AND  table_b_pivot.To_date
    AND table_a.person_id = table_b_pivot.person_id
    ORDER BY table_a.person_id,period
    
    person_id breakfast_date fruit_eaten_for_breakfast period
    1 2023-03-25 apple 1
    1 2023-04-05 kiwi 2
    2 2024-02-12 mango 2