I am trying to find the previous pages visited by a person from web table. I am using lag function to find previous pages that the person has visited only if there was a next page visited.
web table:
id | visit_time | webpage_visited |
---|---|---|
1 | 2024-03-14 10:00:01 | google.com |
1 | 2024-03-14 10:00:07 | |
1 | 2024-03-14 10:01:15 | |
1 | 2024-03-14 10:01:10 | espn.com |
1 | 2024-03-14 10:02:01 |
When I use below SQL, the values returned are not considering the null/blank values between the rows.
SQL used:
select id,
visit_time,
webpage_visited,
coalesce(lag(webpage_visited, 1) over (partition by id order by visit_time asc), 'none') as previous_webpage_visited
from web
Please advice how I can achieve the below expected output.
Expected output:
id | visit_time | webpage_visited | previous_webpage_visited |
---|---|---|---|
1 | 2024-03-14 10:00:01 | google.com | None |
1 | 2024-03-14 10:00:07 | ||
1 | 2024-03-14 10:01:15 | ||
1 | 2024-03-14 10:01:10 | espn.com | google.com |
1 | 2024-03-14 10:02:01 |
The trick consists of two parts - use IGNORE NULLS
for your window function and apply the window function only if the current value is not null:
sample data
WITH dataset(id, visit_time, webpage_visited) AS (
values (1,'2024-03-14 10:00:01','google.com'),
(1,'2024-03-14 10:00:07',NULL),
(1,'2024-03-14 10:01:15',NULL),
(1,'2024-03-14 10:01:10','espn.com'),
(1,'2024-03-14 10:02:01',NULL)
)
-- query
select id,
visit_time,
webpage_visited,
if(webpage_visited is not null,
coalesce(lag(webpage_visited, 1) IGNORE NULLS over(partition by id order by visit_time), 'none'))
as previous_webpage_visited
from dataset
order by visit_time;
Output:
id | visit_time | webpage_visited | previous_webpage_visited |
---|---|---|---|
1 | 2024-03-14 10:00:01 | google.com | none |
1 | 2024-03-14 10:00:07 | NULL | NULL |
1 | 2024-03-14 10:01:10 | espn.com | google.com |
1 | 2024-03-14 10:01:15 | NULL | NULL |
1 | 2024-03-14 10:02:01 | NULL | NULL |
UPD
To handle empty/whitespace entries you can preprocess them, for example with subquery:
select id,
visit_time,
webpage_visited,
if(webpage_visited_sanitized is not null,
coalesce(lag(webpage_visited_sanitized, 1) IGNORE NULLS over(partition by id order by visit_time), 'none'))
as previous_webpage_visited
from (select *,
if(trim(webpage_visited) = '', null, webpage_visited) webpage_visited_sanitized
from dataset
);