Search code examples
sqljoinsnowflake-cloud-data-platform

Conditionally join rows in SQL based on a date value being in or not being in a start/end range


I have two tables:

Table A, names are unique

Name Color
Alex Red
Beck Yellow
Cora Blue

Table B, has addresses and a start/end date representing when the person lived at each address

Name Address Start End
Alex 41 Andover St 2023-01-15 2023-06-08
Alex 8 Lexington Dr 2023-06-08 2023-09-17
Alex 3 Sutor Ave 2023-09-17 NULL
Beck 349 Laurel Rd 2023-08-09 2023-09-23
Beck 155 Country Club Dr 2023-09-23 NULL
Cora 227 Henry Smith St 2022-12-19 NULL

I would like to join these two tables based on a date value passed to the query. If an address was active for that date value, use it. If no addresses are found, then use the earliest active address.

Example 1, date value passed to query is 2023-01-31:

Name Color Address
Alex Red 41 Andover St
Beck Yellow 349 Laurel Rd
Cora Blue 227 Henry Smith St

Example 2, date value passed to query is 2023-09-30:

Name Color Address
Alex Red 3 Sutor Ave
Beck Yellow 155 Country Club Dr
Cora Blue 227 Henry Smith St

So far I have only written queries that join records when the date value exists in the start/end ranges:

SELECT *
FROM tableA a
LEFT JOIN tableB b ON a.name=b.name
AND b.start <= '2024-01-31'
AND (b.end IS NULL OR b.end >= '2024-01-31')

I am not sure how to add the second part where it joins to the earliest active address if no records match. Is this possible using only SQL? I am querying in Snowflake if relevant.


Solution

  • So if you build a CTE that has the first per person, then you can LEFT JOIN this as well as the in date range. Then you can use NVL2 to pick the value based on the exact match success:

    with table_a(name, color) as (
        select * from values
            ('Alex', 'Red'),
            ('Beck', 'Yellow'),
            ('Cora', 'Blue')
    ), table_b(name, address, start_d, end_d) as (
        select * from values
            ('Alex', '41 Andover St', '2023-01-15', '2023-06-08'),
            ('Alex', '8 Lexington Dr', '2023-06-08', '2023-09-17'),
            ('Alex', '3 Sutor Ave', '2023-09-17', NULL),
            ('Beck', '349 Laurel Rd', '2023-08-09', '2023-09-23'),
            ('Beck', '155 Country Club Dr', '2023-09-23', NULL),
            ('Cora', '227 Henry Smith St', '2022-12-19', NULL)
    ), dates(date_d) as (
        select * from values
        ('2023-01-31'::date),
        ('2023-09-30'::date)
    ), first_per_person as (
        select * 
        from table_b
        qualify row_number() over (partition by name order by start_d) = 1
    )
    SELECT 
        d.*,
        a.*,
        nvl2(b.name, b.address, f.address) as address
        --nvl2(b.name, b.start_d, f.start_d) as start_d,
        --nvl2(b.name, b.end_d, f.end_d) as end_d
    FROM dates as d
    cross join table_a as a
    left join first_per_person as f 
        ON a.name=f.name    
    LEFT JOIN table_b as b 
        ON a.name=b.name
            AND b.start_d <= d.date_d
            AND (b.end_d IS NULL OR b.end_d >= d.date_d)
    order by 1,2
    

    Here I put both dates into another CTE, just so we could see it all working at the same time.

    If you wanted instead of "first" the most resent to the date used, ASOF JOIN is the optimal way to get that.

    This code also has the "date start/end" of the row that was used, to show how to use NVL2 to get other values, because the address line could use NVL/COALESCE but that will not work for the end date as that might valid as null, where the primary key name does not make sense to be null.

    enter image description here