I am trying to find login start time, login end time and the total time spent on a webpage in seconds using lead and datediff functions in Hive.
Sample dataset:
id | login_time |
---|---|
1 | 2023-05-03 00:20:37.000 |
1 | 2023-05-03 00:20:51.000 |
2 | 2023-05-03 15:42:31.000 |
Query used:
with temp as
(
select id,
login_time as start_time
from login_table
)
select id,
start_time,
lead(start_time) over(partition by id) as end_time,
datediff('second', start_time, lead(start_time) over (partition by id)) as time_spent_on_page
from temp
But am getting error saying
Invalid number of arguments in datediff. Expected 2, found 3
Please advise how I can find the time difference between 2 values in seconds on Hive.
Expected output:
id | start_time | end_time | time_spent_on_page |
---|---|---|---|
1 | 2023-05-03 00:20:37.000 | 2023-05-03 00:20:51.000 | 14 |
1 | 2023-05-03 00:20:51.000 | ||
2 | 2023-05-03 12:00:54.000 | 2023-05-03 12:01:09.000 | 15 |
Hive's datediff
takes two date strings and return their difference in days.
To get the difference between two datetime strings, we can use unix_timestamp
twice:
select id,
login_time as start_time,
lead(login_time) over(partition by id order by login_time) as end_time,
unix_timestamp(lead(login_time) over (partition by id order by login_time))
- unix_timestamp(login_time) as time_spent_on_page
from login_table
Notes:
with
clause seems unnecessary, we can just select directly from the login tablelead()
needs an order by
clauseReference: Hive date functions