Search code examples
sqldatehivehiveqldate-range

Hive SQL query to fill missing date values in table with nearest values between date range


For the below example, if i used the same data below and if I wanted both Mary and Peter accounts in the same date range, how would I ammend the hive sql query to go about doing this? For example set the date range between '2021-05-24' and '2021-06-03' and populate all balances during this period. If we take Mary as an example, I would also want to see Mary available balance of '53028.1' forward filled to the '2021-06-03' and also if Mary didn't have a value on the '2021-05-24' to have it back filled with balance of '50000'.

with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03',
'Mary',float(50000),'2021-05-24',
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE

select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
       lead(Date_of_balance,1, Date_of_balance) over (partition by account_name order by Date_of_balance) next_date    
  from mytable d  --use your table
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table

Result:

account_name    available_balance   date_of_balance 
Peter           50000                2021-05-24
Peter           50035                2021-05-25
Peter           50035                2021-05-26
Peter           50035                2021-05-27
Peter           50610                2021-05-28
Peter           50610                2021-05-29
Peter           50610                2021-05-30
Peter           50610                2021-05-31
Peter           51710                2021-06-01
Peter           53028.1              2021-06-02
Peter           53916.1              2021-06-03
Mary            50000                2021-05-24
Mary            50035                2021-05-25
Mary            50035                2021-05-26
Mary            50035                2021-05-27
Mary            50035                2021-05-28
Mary            50035                2021-05-29
Mary            53028.1              2021-05-30

Note, this left join has helped me get this far in the attached link here


@Left Join

I have an extremely large table, whereby I require balances for the last 90 days on a daily basis. The number of accounts exceed 1 million accounts and the balance table is huge and balance records are only updated when an account balance changes. There are certain accounts that maybe would have not had a balance date record updated for more than a year, hence the below code proposed by -left join won’t really work.

I have two tables:

**Accounts lookup table:** 

account_name, observation_date
'Peter','2021-05-24'
'Luis','2021-03-21'

Balance table

account_name,account_balance,balance_date
'Peter',50000,'2020-03-20'
'Peter',50035,'2021-04-27'
'Peter',43821,'2021-05-21'
'Peter',50610,'2021-05-22'
'Mary',51710,'2019-03-20'
'Mary',53028.1,'2021-04-27'
'Mary',53916.1,'2021-05-21'
'Mary',54632.76,'2021-05-22'
'Roger',55147.76,'2021-03-03'
'Roger',55293.96,'2021-02-03'
'Roger',57142.15,'2021-03-04'
'Roger',67834.15,'2021-04-01'

The HIVE SQL query I'm looking for will be able to join these two tables and provide a result similar to below

Results Table

account_name,account_balance,balance_date
Peter,50000,2020-03-20
Peter,50000,2021-02-24
Peter,50000,2021-02-25
Peter,…,…
Peter,50035,2021-04-27
Peter,50035,2021-04-28
Peter,50035,2021-04-29
Peter,…,…
Peter,43821,2021-05-21
Peter,50610,2021-05-22
Peter,43821,2021-05-23
Peter,43821,2021-05-24
Roger,55147.76,05/01/2021
Roger,55147.76,06/01/2021
Roger,55147.76,07/01/2021
Roger,…,…
Roger,55293.96,2021-02-03
Roger,57142.15,2021-02-04
Roger,57142.15,2021-02-05
Roger,…,…
Roger,67834.15,2021-04-01
Roger,67834.15,2021-04-02
Roger,67834.15,2021-04-03
Roger,67834.15,2021-04-04
Roger,67834.15,2021-04-05

I know we can potentially grab all the balances since beginning of time and then do the lead function, however for a large scale environment, that won't work when querying in the millions daily.


Solution

  • Calculate additionally min and max dates for the whole dataset to determine date range required, also calculate min date per account to check if min date needs fixing. Then add additional step of calculation for both dates: check if it is boundary dates and if they are not as required, assign min and max values accordingly.

    In this example Peter start date is 2021-05-24 nad Mary starts with 2021-05-23, so, the range was extended and 2021-05-23 record generated for Peter. For Mary last date is 2021-05-30, missing rows generated at the end of the range.

    with mytable as (--Demo dataset, use your table instead of this
    select stack(10, --number of tuples
    'Peter',float(50000),'2021-05-24',
    'Peter',float(50035),'2021-05-25',
    'Peter',float(50035),'2021-05-26',
    'Peter',float(50610),'2021-05-28',
    'Peter',float(51710),'2021-06-01',
    'Peter',float(53028.1),'2021-06-02',
    'Peter',float(53916.1),'2021-06-03', -------------end date greater than Mary
    'Mary',float(50000),'2021-05-23', ----------------start date Less than Peter
    'Mary',float(50035),'2021-05-25',
    'Mary',float(53028.1),'2021-05-30'
    ) as (account_name,available_balance,Date_of_balance)
    ) --use your table instead of this CTE
    
    select  account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
    from
    (select account_name, available_balance, 
            case when min_date < min_date_account and Date_of_balance = min_date_account then min_date 
                 else Date_of_balance 
             end Date_of_balance,
            
            case when (next_date is null) and (Date_of_balance = max_date) then Date_of_balance 
                 when (Date_of_balance < max_date) then nvl(next_date,date_add(max_date,1)) 
             end as next_date
    from
    ( --Get next_date to generate date range
    select account_name,available_balance,Date_of_balance,
           lead(Date_of_balance,1) over (partition by account_name order by Date_of_balance) next_date,
           max(Date_of_balance) over() max_date, --total min and max dates all accounts should align
           min(Date_of_balance) over() min_date, 
           min(Date_of_balance) over(partition by account_name) min_date_account
      from mytable d  --use your table
    ) s 
    ) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
    order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table
    

    Result:

    account_name    available_balance   date_of_balance
    Peter             50000              2021-05-23
    Peter             50000              2021-05-24
    Peter             50035              2021-05-25
    Peter             50035              2021-05-26
    Peter             50035              2021-05-27
    Peter             50610              2021-05-28
    Peter             50610              2021-05-29
    Peter             50610              2021-05-30
    Peter             50610              2021-05-31
    Peter             51710              2021-06-01
    Peter             53028.1            2021-06-02
    Peter             53916.1            2021-06-03
    Mary              50000              2021-05-23
    Mary              50000              2021-05-24
    Mary              50035              2021-05-25
    Mary              50035              2021-05-26
    Mary              50035              2021-05-27
    Mary              50035              2021-05-28
    Mary              50035              2021-05-29
    Mary              53028.1            2021-05-30
    Mary              53028.1            2021-05-31
    Mary              53028.1            2021-06-01
    Mary              53028.1            2021-06-02
    Mary              53028.1            2021-06-03
    

    Note that lead function calculated differently also, it does not have default value, NULL indicates the end date available