Search code examples
sqlhiveprestotrino

LAG function to return previous rows by discarding null values


I am trying to find the previous pages visited by a person from web table. I am using lag function to find previous pages that the person has visited only if there was a next page visited.

web table:

id visit_time webpage_visited
1 2024-03-14 10:00:01 google.com
1 2024-03-14 10:00:07
1 2024-03-14 10:01:15
1 2024-03-14 10:01:10 espn.com
1 2024-03-14 10:02:01

When I use below SQL, the values returned are not considering the null/blank values between the rows.

SQL used:

select id, 
visit_time, 
webpage_visited, 
coalesce(lag(webpage_visited, 1) over (partition by id order by visit_time  asc), 'none') as previous_webpage_visited
from web

Please advice how I can achieve the below expected output.

Expected output:

id visit_time webpage_visited previous_webpage_visited
1 2024-03-14 10:00:01 google.com None
1 2024-03-14 10:00:07
1 2024-03-14 10:01:15
1 2024-03-14 10:01:10 espn.com google.com
1 2024-03-14 10:02:01

Solution

  • The trick consists of two parts - use IGNORE NULLS for your window function and apply the window function only if the current value is not null:

    sample data
    WITH dataset(id, visit_time, webpage_visited) AS (
       values (1,'2024-03-14 10:00:01','google.com'),
            (1,'2024-03-14 10:00:07',NULL),
            (1,'2024-03-14 10:01:15',NULL),
            (1,'2024-03-14 10:01:10','espn.com'),
            (1,'2024-03-14 10:02:01',NULL)
    )
    
    -- query
    select id,
           visit_time,
           webpage_visited,
           if(webpage_visited is not null,
               coalesce(lag(webpage_visited, 1) IGNORE NULLS over(partition by id order by visit_time), 'none'))
               as previous_webpage_visited
    from dataset
    order by visit_time;
    

    Output:

    id visit_time webpage_visited previous_webpage_visited
    1 2024-03-14 10:00:01 google.com none
    1 2024-03-14 10:00:07 NULL NULL
    1 2024-03-14 10:01:10 espn.com google.com
    1 2024-03-14 10:01:15 NULL NULL
    1 2024-03-14 10:02:01 NULL NULL

    UPD

    To handle empty/whitespace entries you can preprocess them, for example with subquery:

    select id,
           visit_time,
           webpage_visited,
           if(webpage_visited_sanitized is not null,
               coalesce(lag(webpage_visited_sanitized, 1) IGNORE NULLS over(partition by id order by visit_time), 'none'))
               as previous_webpage_visited
    from (select *,
                 if(trim(webpage_visited) = '', null, webpage_visited) webpage_visited_sanitized
          from dataset
         );