Search code examples
sqlhivehiveqlwindow-functionsgaps-and-islands

Conditional LEAD/LAG with no sequence guarantee


How does one write a conditional lead/lag where the preceding or proceeding lead/lag isn't guaranteed to satisfy a certain condition? In my case, I'm looking at website traffic.

Sample data (prior_path and prior_event are the target fields of which I'm having trouble getting to prior_event given my conditions)

+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+
| sessionid | hit | type  |                 path                  | event |             prior_path             | prior_event |
+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+
|      1001 |   1 | event | www.stackoverflow.com                 | hover |                                    |             |
|      1001 |   2 | page  | www.stackoverflow.com                 |       |                                    | hover       |
|      1001 |   3 | event | www.stackoverflow.com                 | load  |                                    |             |
|      1001 |   4 | event | www.stackoverflow.com                 | blur  |                                    | load        |
|      1001 |   5 | event | www.stackoverflow.com                 | click |                                    | blur        |
|      1001 |   6 | page  | www.stackoverflow.com/post/10         |       | www.stackoverflow.com              | click       |
|      1001 |   7 | event | www.stackoverflow.com/post/10#details | offer |                                    |             |
|      1001 |   8 | page  | www.stackoverflow.com/post/confirm    |       | www.stackoverflow.com/post/10      | offer       |
|      1001 |   9 | page  | www.stackoverflow.com/questions/10    |       | www.stackoverflow.com/post/confirm | offer       |
|      1001 |  10 | event | www.stackoverflow.com/questions/10    | exit  |                                    |             |
+-----------+-----+-------+---------------------------------------+-------+------------------------------------+-------------+

prior_path: last path where type = page for only page hit types prior_event: last event where type = event for all hit types

Notice for hit 8 and 9, the "offer" event gets repeated because those led to those pages.

prior_path seems straight-forward as I can do

SELECT LAG(path) OVER (PARTITION BY sessionid, type ORDER BY hit) FROM my_table

but I'm not sure how to get prior_event.


Solution

  • You have the right expression for prior_path already. You just need to wrap it in a conditional expression.

    As for prior_event, it is a bit more complicated indeed. I would suggest the following approach:

    • for events, we can just use lag()

    • for pages, an option is to use some gaps-and-islands technique: first define groups with a conditional sum that increments every time an event is met, and then use first_value():

    This should do what you want:

    select  
        t.*,
        case when type = 'page'
            then lag(path) over(partition by sessionid, type  order by hit)
        end prior_path,
        case type 
            when 'page'
                then first_value(event) over(partition by sessionid, grp order by hit)
            when 'event' 
                then lag(event) over(partition by sessionid order by hit)
            end prior_event
    from (
        select 
            t.*,
            sum(case when type = 'event' then 1 else 0 end) 
                over(partition by sessionid order by hit) grp
        from mytable t
    ) t
    

    Demo on DB Fiddle (due to the lack of hive fiddle in the wild, I used Postgres - but this would work on Hive as well):

    sessionid | hit | type  | path                                  | event | grp | prior_path                         | prior_event
    --------: | --: | :---- | :------------------------------------ | :---- | --: | :--------------------------------- | :----------
         1001 |   1 | event | www.stackoverflow.com                 | hover |   1 | null                               | null       
         1001 |   2 | page  | www.stackoverflow.com                 | null  |   1 | null                               | hover      
         1001 |   3 | event | www.stackoverflow.com                 | load  |   2 | null                               | null       
         1001 |   4 | event | www.stackoverflow.com                 | blur  |   3 | null                               | load       
         1001 |   5 | event | www.stackoverflow.com                 | click |   4 | null                               | blur       
         1001 |   6 | page  | www.stackoverflow.com/post/10         | null  |   4 | www.stackoverflow.com              | click      
         1001 |   7 | event | www.stackoverflow.com/post/10#details | offer |   5 | null                               | null       
         1001 |   8 | page  | www.stackoverflow.com/post/confirm    | null  |   5 | www.stackoverflow.com/post/10      | offer      
         1001 |   9 | page  | www.stackoverflow.com/questions/10    | null  |   5 | www.stackoverflow.com/post/confirm | offer      
         1001 |  10 | event | www.stackoverflow.com/questions/10    | exit  |   6 | null                               | null