As a simplified scenario, suppose I have the following tables:
deal_views:
user_id | viewed_at | more attributes ...
------------------------------
1 | 2020-07-12 15:00:00 | ...
1 | 2020-07-12 14:00:00 | ...
1 | 2020-07-12 13:00:00 | ...
1 | 2020-07-03 09:00:00 | ...
2 | 2020-07-12 15:00:00 | ...
2 | 2020-07-12 14:00:00 | ...
2 | 2020-07-12 13:00:00 | ...
3 | 2020-07-12 15:00:00 | ...
purchases:
user_id | purchased_at | more attributes ...
------------------------------
1 | 2020-07-03 09:02:10 | ...
1 | 2020-07-12 14:04:53 | ...
2 | 2020-07-12 15:20:12 | ...
(Not every user who views the deals will make a purchase; some users will view deals multiple times before making a purchase; some users will make multiple purchases. Every user who made a purchase must have viewed the deals at least once; probably just before the purchase!)
What I want is a selection of deal_views
(including some additional attributes; omitted above for simplicity) that occurred most recently before a purchase, for each user. The motivation here is that the visible deals change over time, so we'd like to know "what was the state of the deals, most recently before a purchase?".
So for the data above, this would be:
user_id | viewed_at | more attributes ...
------------------------------
1 | 2020-07-03 09:00:00 | ...
1 | 2020-07-11 14:00:00 | ...
2 | 2020-07-11 15:00:00 | ...
SQL performance would be nice, but it's not a huge concern (so long as it's runnable!) as this is only a one-off query I wish to execute. There should be approximately 200,000 rows in the results table.
Snowflake supports lateral joins, so you can do:
select p.*, d.*
from purchases p left join lateral
(select d.*
from deals d
where d.user_id = p.user_id and
d.viewed_at < p.purchased_at
order by d.viewed_at desc
limit 1
) d
on 1=1;
You can also row_number()
:
select pd.*
from (select p.*, d.*, -- select the columns explicitly to avoid duplicate column names
row_number() over (partition by p.user_id, p.purchased_at order by d.viewed_at desc) as seqnum
from purchases p left join
deals d
on d.user_id = p.user_id and d.viewed_at < p.purchased_at
) pd
where seqnum = 1;