Search code examples
sqlinner-joinsnowflake-cloud-data-platform

Performing a JOIN with the closest-matching time


As a simplified scenario, suppose I have the following tables:

deal_views:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-12 15:00:00 | ...
1       | 2020-07-12 14:00:00 | ...
1       | 2020-07-12 13:00:00 | ...
1       | 2020-07-03 09:00:00 | ...
2       | 2020-07-12 15:00:00 | ...
2       | 2020-07-12 14:00:00 | ...
2       | 2020-07-12 13:00:00 | ...
3       | 2020-07-12 15:00:00 | ...

purchases:

user_id | purchased_at        | more attributes ...
------------------------------
1       | 2020-07-03 09:02:10 | ...
1       | 2020-07-12 14:04:53 | ...
2       | 2020-07-12 15:20:12 | ...

(Not every user who views the deals will make a purchase; some users will view deals multiple times before making a purchase; some users will make multiple purchases. Every user who made a purchase must have viewed the deals at least once; probably just before the purchase!)

What I want is a selection of deal_views (including some additional attributes; omitted above for simplicity) that occurred most recently before a purchase, for each user. The motivation here is that the visible deals change over time, so we'd like to know "what was the state of the deals, most recently before a purchase?".

So for the data above, this would be:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-03 09:00:00 | ...
1       | 2020-07-11 14:00:00 | ...
2       | 2020-07-11 15:00:00 | ...

SQL performance would be nice, but it's not a huge concern (so long as it's runnable!) as this is only a one-off query I wish to execute. There should be approximately 200,000 rows in the results table.


Solution

  • Snowflake supports lateral joins, so you can do:

    select p.*, d.*
    from purchases p left join lateral
         (select d.*
          from deals d
          where d.user_id = p.user_id and
                d.viewed_at < p.purchased_at
          order by d.viewed_at desc
          limit 1
         ) d
         on 1=1;
    

    You can also row_number():

    select pd.*
    from (select p.*, d.*,   -- select the columns explicitly to avoid duplicate column names
                row_number() over (partition by p.user_id, p.purchased_at order by d.viewed_at desc) as seqnum
          from purchases p left join
               deals d
               on d.user_id = p.user_id and d.viewed_at < p.purchased_at 
         ) pd
    where seqnum = 1;