Search code examples
sqldatehivehiveqldate-range

Hive SQL query to fill missing date values in table with nearest values


I have spent days trying to figure out how to add missing dates with the nearest values in Hive with no luck. I need to use Hive SQL for this based on environment constraints. The raw table currently looks like the below table.

account name,available balance,Date of balance 

Peter,50000,2021-05-24
Peter,50035,2021-05-25
Peter,50035,2021-05-26
Peter,50610,2021-05-28
Peter,51710,2021-06-01
Peter,53028.1,2021-06-02
Peter,53916.1,2021-06-03
Mary,50000,2021-05-24
Mary,50035,2021-05-25
Mary,53028.1,2021-05-30

Raw balance table

What I need is to convert the table above to the table in below link:

account name,available balance,Date of balance 

Peter,50000,2021-05-24
Peter,50035,2021-05-25
Peter,50035,2021-05-26
Peter,50035,2021-05-27
Peter,50610,2021-05-28
Peter,50610,2021-05-29
Peter,50610,2021-05-30
Mary,50000,2021-05-24
Mary,50035,2021-05-25
Mary,50035,2021-05-26
Mary,50035,2021-05-27
Mary,50035,2021-05-28
Mary,50035,2021-05-29
Mary,53028.1,2021-05-30

Converted table

Can anyone please share the Hive SQL logic to make this change please?


Solution

  • Get next date using lead() function, calculate difference in days, get string of spaces with length = diff in days, split, use posexplode to generate rows, use position to add to date to get missing dates:

    with mytable as (--Demo dataset, use your table instead of this
    select stack(10, --number of tuples
    'Peter',float(50000),'2021-05-24',
    'Peter',float(50035),'2021-05-25',
    'Peter',float(50035),'2021-05-26',
    'Peter',float(50610),'2021-05-28',
    'Peter',float(51710),'2021-06-01',
    'Peter',float(53028.1),'2021-06-02',
    'Peter',float(53916.1),'2021-06-03',
    'Mary',float(50000),'2021-05-24',
    'Mary',float(50035),'2021-05-25',
    'Mary',float(53028.1),'2021-05-30'
    ) as (account_name,available_balance,Date_of_balance)
    ) --use your table instead of this CTE
    
    select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
    from
    ( --Get next_date to generate date range
    select account_name,available_balance,Date_of_balance,
           lead(Date_of_balance,1, Date_of_balance) over (partition by account_name order by Date_of_balance) next_date    
      from mytable d  --use your table
    ) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
    order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table
    

    Result:

    account_name    available_balance   date_of_balance 
    Peter           50000                2021-05-24
    Peter           50035                2021-05-25
    Peter           50035                2021-05-26
    Peter           50035                2021-05-27
    Peter           50610                2021-05-28
    Peter           50610                2021-05-29
    Peter           50610                2021-05-30
    Peter           50610                2021-05-31
    Peter           51710                2021-06-01
    Peter           53028.1              2021-06-02
    Peter           53916.1              2021-06-03
    Mary            50000                2021-05-24
    Mary            50035                2021-05-25
    Mary            50035                2021-05-26
    Mary            50035                2021-05-27
    Mary            50035                2021-05-28
    Mary            50035                2021-05-29
    Mary            53028.1              2021-05-30