sql inner-join snowflake-cloud-data-platform

Performing a JOIN with the closest-matching time

As a simplified scenario, suppose I have the following tables:

deal_views:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-12 15:00:00 | ...
1       | 2020-07-12 14:00:00 | ...
1       | 2020-07-12 13:00:00 | ...
1       | 2020-07-03 09:00:00 | ...
2       | 2020-07-12 15:00:00 | ...
2       | 2020-07-12 14:00:00 | ...
2       | 2020-07-12 13:00:00 | ...
3       | 2020-07-12 15:00:00 | ...

purchases:

user_id | purchased_at        | more attributes ...
------------------------------
1       | 2020-07-03 09:02:10 | ...
1       | 2020-07-12 14:04:53 | ...
2       | 2020-07-12 15:20:12 | ...

(Not every user who views the deals will make a purchase; some users will view deals multiple times before making a purchase; some users will make multiple purchases. Every user who made a purchase must have viewed the deals at least once; probably just before the purchase!)

What I want is a selection of deal_views (including some additional attributes; omitted above for simplicity) that occurred most recently before a purchase, for each user. The motivation here is that the visible deals change over time, so we'd like to know "what was the state of the deals, most recently before a purchase?".

So for the data above, this would be:

user_id | viewed_at           | more attributes ...
------------------------------
1       | 2020-07-03 09:00:00 | ...
1       | 2020-07-11 14:00:00 | ...
2       | 2020-07-11 15:00:00 | ...

SQL performance would be nice, but it's not a huge concern (so long as it's runnable!) as this is only a one-off query I wish to execute. There should be approximately 200,000 rows in the results table.

Solution

Snowflake supports lateral joins, so you can do:

select p.*, d.*
from purchases p left join lateral
     (select d.*
      from deals d
      where d.user_id = p.user_id and
            d.viewed_at < p.purchased_at
      order by d.viewed_at desc
      limit 1
     ) d
     on 1=1;

You can also row_number():

select pd.*
from (select p.*, d.*,   -- select the columns explicitly to avoid duplicate column names
            row_number() over (partition by p.user_id, p.purchased_at order by d.viewed_at desc) as seqnum
      from purchases p left join
           deals d
           on d.user_id = p.user_id and d.viewed_at < p.purchased_at 
     ) pd
where seqnum = 1;