Search code examples
sqlapache-sparkhivedate-arithmetic

datediff in Hive resulting in Invalid number of arguments error


I am trying to find login start time, login end time and the total time spent on a webpage in seconds using lead and datediff functions in Hive.

Sample dataset:

id login_time
1 2023-05-03 00:20:37.000
1 2023-05-03 00:20:51.000
2 2023-05-03 15:42:31.000

Query used:

with temp as
(
    select id,
        login_time as start_time
    from login_table
)
select id, 
start_time, 
lead(start_time) over(partition by id) as end_time,
datediff('second', start_time, lead(start_time) over (partition by id)) as time_spent_on_page
from temp

But am getting error saying

Invalid number of arguments in datediff. Expected 2, found 3

Please advise how I can find the time difference between 2 values in seconds on Hive.

Expected output:

id start_time end_time time_spent_on_page
1 2023-05-03 00:20:37.000 2023-05-03 00:20:51.000 14
1 2023-05-03 00:20:51.000
2 2023-05-03 12:00:54.000 2023-05-03 12:01:09.000 15

Solution

  • Hive's datediff takes two date strings and return their difference in days.

    To get the difference between two datetime strings, we can use unix_timestamp twice:

    select id, 
        login_time as start_time,
        lead(login_time) over(partition by id order by login_time) as end_time,
        unix_timestamp(lead(login_time) over (partition by id order by login_time))
            - unix_timestamp(login_time) as time_spent_on_page
    from login_table
    

    Notes:

    • the with clause seems unnecessary, we can just select directly from the login table
    • lead() needs an order by clause

    Reference: Hive date functions