How does one write a conditional lead/lag where the preceding or proceeding lead/lag isn't guaranteed to satisfy a certain condition? In my case, I'm looking at website traffic.
Sample data (prior_path and prior_event are the target fields of which I'm having trouble getting to prior_event given my conditions)
+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+
| sessionid | hit | type | path | event | prior_path | prior_event |
+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+
| 1001 | 1 | event | www.stackoverflow.com | hover | | |
| 1001 | 2 | page | www.stackoverflow.com | | | hover |
| 1001 | 3 | event | www.stackoverflow.com | load | | |
| 1001 | 4 | event | www.stackoverflow.com | blur | | load |
| 1001 | 5 | event | www.stackoverflow.com | click | | blur |
| 1001 | 6 | page | www.stackoverflow.com/post/10 | | www.stackoverflow.com | click |
| 1001 | 7 | event | www.stackoverflow.com/post/10#details | offer | | |
| 1001 | 8 | page | www.stackoverflow.com/post/confirm | | www.stackoverflow.com/post/10 | offer |
| 1001 | 9 | page | www.stackoverflow.com/questions/10 | | www.stackoverflow.com/post/confirm | offer |
| 1001 | 10 | event | www.stackoverflow.com/questions/10 | exit | | |
+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+
prior_path: last path where type = page for only page hit types prior_event: last event where type = event for all hit types
Notice for hit 8 and 9, the "offer" event gets repeated because those led to those pages.
prior_path seems straight-forward as I can do
SELECT LAG(path) OVER (PARTITION BY sessionid, type ORDER BY hit) FROM my_table
but I'm not sure how to get prior_event.
You have the right expression for prior_path
already. You just need to wrap it in a conditional expression.
As for prior_event
, it is a bit more complicated indeed. I would suggest the following approach:
for events, we can just use lag()
for pages, an option is to use some gaps-and-islands technique: first define groups with a conditional sum that increments every time an event is met, and then use first_value()
:
This should do what you want:
select
t.*,
case when type = 'page'
then lag(path) over(partition by sessionid, type order by hit)
end prior_path,
case type
when 'page'
then first_value(event) over(partition by sessionid, grp order by hit)
when 'event'
then lag(event) over(partition by sessionid order by hit)
end prior_event
from (
select
t.*,
sum(case when type = 'event' then 1 else 0 end)
over(partition by sessionid order by hit) grp
from mytable t
) t
Demo on DB Fiddle (due to the lack of hive fiddle in the wild, I used Postgres - but this would work on Hive as well):
sessionid | hit | type | path | event | grp | prior_path | prior_event --------: | --: | :---- | :------------------------------------ | :---- | --: | :--------------------------------- | :---------- 1001 | 1 | event | www.stackoverflow.com | hover | 1 | null | null 1001 | 2 | page | www.stackoverflow.com | null | 1 | null | hover 1001 | 3 | event | www.stackoverflow.com | load | 2 | null | null 1001 | 4 | event | www.stackoverflow.com | blur | 3 | null | load 1001 | 5 | event | www.stackoverflow.com | click | 4 | null | blur 1001 | 6 | page | www.stackoverflow.com/post/10 | null | 4 | www.stackoverflow.com | click 1001 | 7 | event | www.stackoverflow.com/post/10#details | offer | 5 | null | null 1001 | 8 | page | www.stackoverflow.com/post/confirm | null | 5 | www.stackoverflow.com/post/10 | offer 1001 | 9 | page | www.stackoverflow.com/questions/10 | null | 5 | www.stackoverflow.com/post/confirm | offer 1001 | 10 | event | www.stackoverflow.com/questions/10 | exit | 6 | null | null