For the below example, if i used the same data below and if I wanted both Mary and Peter accounts in the same date range, how would I ammend the hive sql query to go about doing this? For example set the date range between '2021-05-24' and '2021-06-03' and populate all balances during this period. If we take Mary as an example, I would also want to see Mary available balance of '53028.1' forward filled to the '2021-06-03' and also if Mary didn't have a value on the '2021-05-24' to have it back filled with balance of '50000'.
with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03',
'Mary',float(50000),'2021-05-24',
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE
select account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
lead(Date_of_balance,1, Date_of_balance) over (partition by account_name order by Date_of_balance) next_date
from mytable d --use your table
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table
Result:
account_name available_balance date_of_balance
Peter 50000 2021-05-24
Peter 50035 2021-05-25
Peter 50035 2021-05-26
Peter 50035 2021-05-27
Peter 50610 2021-05-28
Peter 50610 2021-05-29
Peter 50610 2021-05-30
Peter 50610 2021-05-31
Peter 51710 2021-06-01
Peter 53028.1 2021-06-02
Peter 53916.1 2021-06-03
Mary 50000 2021-05-24
Mary 50035 2021-05-25
Mary 50035 2021-05-26
Mary 50035 2021-05-27
Mary 50035 2021-05-28
Mary 50035 2021-05-29
Mary 53028.1 2021-05-30
Note, this left join has helped me get this far in the attached link here
I have an extremely large table, whereby I require balances for the last 90 days on a daily basis. The number of accounts exceed 1 million accounts and the balance table is huge and balance records are only updated when an account balance changes. There are certain accounts that maybe would have not had a balance date record updated for more than a year, hence the below code proposed by -left join won’t really work.
I have two tables:
**Accounts lookup table:**
account_name, observation_date
'Peter','2021-05-24'
'Luis','2021-03-21'
Balance table
account_name,account_balance,balance_date
'Peter',50000,'2020-03-20'
'Peter',50035,'2021-04-27'
'Peter',43821,'2021-05-21'
'Peter',50610,'2021-05-22'
'Mary',51710,'2019-03-20'
'Mary',53028.1,'2021-04-27'
'Mary',53916.1,'2021-05-21'
'Mary',54632.76,'2021-05-22'
'Roger',55147.76,'2021-03-03'
'Roger',55293.96,'2021-02-03'
'Roger',57142.15,'2021-03-04'
'Roger',67834.15,'2021-04-01'
The HIVE SQL query I'm looking for will be able to join these two tables and provide a result similar to below
account_name,account_balance,balance_date
Peter,50000,2020-03-20
Peter,50000,2021-02-24
Peter,50000,2021-02-25
Peter,…,…
Peter,50035,2021-04-27
Peter,50035,2021-04-28
Peter,50035,2021-04-29
Peter,…,…
Peter,43821,2021-05-21
Peter,50610,2021-05-22
Peter,43821,2021-05-23
Peter,43821,2021-05-24
Roger,55147.76,05/01/2021
Roger,55147.76,06/01/2021
Roger,55147.76,07/01/2021
Roger,…,…
Roger,55293.96,2021-02-03
Roger,57142.15,2021-02-04
Roger,57142.15,2021-02-05
Roger,…,…
Roger,67834.15,2021-04-01
Roger,67834.15,2021-04-02
Roger,67834.15,2021-04-03
Roger,67834.15,2021-04-04
Roger,67834.15,2021-04-05
I know we can potentially grab all the balances since beginning of time and then do the lead function, however for a large scale environment, that won't work when querying in the millions daily.
Calculate additionally min and max dates for the whole dataset to determine date range required, also calculate min date per account to check if min date needs fixing. Then add additional step of calculation for both dates: check if it is boundary dates and if they are not as required, assign min and max values accordingly.
In this example Peter start date is 2021-05-24 nad Mary starts with 2021-05-23, so, the range was extended and 2021-05-23 record generated for Peter. For Mary last date is 2021-05-30, missing rows generated at the end of the range.
with mytable as (--Demo dataset, use your table instead of this
select stack(10, --number of tuples
'Peter',float(50000),'2021-05-24',
'Peter',float(50035),'2021-05-25',
'Peter',float(50035),'2021-05-26',
'Peter',float(50610),'2021-05-28',
'Peter',float(51710),'2021-06-01',
'Peter',float(53028.1),'2021-06-02',
'Peter',float(53916.1),'2021-06-03', -------------end date greater than Mary
'Mary',float(50000),'2021-05-23', ----------------start date Less than Peter
'Mary',float(50035),'2021-05-25',
'Mary',float(53028.1),'2021-05-30'
) as (account_name,available_balance,Date_of_balance)
) --use your table instead of this CTE
select account_name, available_balance, date_add(Date_of_balance,e.i) as Date_of_balance
from
(select account_name, available_balance,
case when min_date < min_date_account and Date_of_balance = min_date_account then min_date
else Date_of_balance
end Date_of_balance,
case when (next_date is null) and (Date_of_balance = max_date) then Date_of_balance
when (Date_of_balance < max_date) then nvl(next_date,date_add(max_date,1))
end as next_date
from
( --Get next_date to generate date range
select account_name,available_balance,Date_of_balance,
lead(Date_of_balance,1) over (partition by account_name order by Date_of_balance) next_date,
max(Date_of_balance) over() max_date, --total min and max dates all accounts should align
min(Date_of_balance) over() min_date,
min(Date_of_balance) over(partition by account_name) min_date_account
from mytable d --use your table
) s
) s lateral view outer posexplode(split(space(datediff(next_date,Date_of_balance)-1),'')) e as i,x --generate rows
order by account_name desc, Date_of_balance --this is to have order of rows like in your Converted Table
Result:
account_name available_balance date_of_balance
Peter 50000 2021-05-23
Peter 50000 2021-05-24
Peter 50035 2021-05-25
Peter 50035 2021-05-26
Peter 50035 2021-05-27
Peter 50610 2021-05-28
Peter 50610 2021-05-29
Peter 50610 2021-05-30
Peter 50610 2021-05-31
Peter 51710 2021-06-01
Peter 53028.1 2021-06-02
Peter 53916.1 2021-06-03
Mary 50000 2021-05-23
Mary 50000 2021-05-24
Mary 50035 2021-05-25
Mary 50035 2021-05-26
Mary 50035 2021-05-27
Mary 50035 2021-05-28
Mary 50035 2021-05-29
Mary 53028.1 2021-05-30
Mary 53028.1 2021-05-31
Mary 53028.1 2021-06-01
Mary 53028.1 2021-06-02
Mary 53028.1 2021-06-03
Note that lead
function calculated differently also, it does not have default value, NULL indicates the end date available