I need to calculate a row's value based on a calculated value from the previous row. I'm looking for some window function (or other solution), if there is one, that can help me do this. For context, we have the following data:
CREATE OR REPLACE TABLE example_data (
SALES_DATE DATE,
CURR_DATE_UNITS_PURCHASED NUMBER(10,1),
LTD_UNITS_PURCHASED NUMBER(10,1),
CURR_DATE_EXTRAP_NUM_SOLD NUMBER(10,1),
LTD_EXTRAP_NUM_SOLD NUMBER(10,1)
);
INSERT INTO example_data VALUES
('2023-11-01', 1000, 1000, 0, 0),
('2023-11-02', 0, 1000, 0, 0),
('2023-11-03', 0, 1000, 0, 0),
('2023-11-04', 200, 1200, 0, 0),
('2023-11-05', 0, 1200, 0, 0),
('2023-11-06', 0, 1200, 0, 0),
('2023-11-07', 50, 1250, 0, 0),
('2023-11-08', 0, 1250, 0, 0);
SALES_DATE | CURR_DATE_UNITS_PURCHASED | LTD_UNITS_PURCHASED | CURR_DATE_EXTRAP_NUM_SOLD | LTD_EXTRAP_NUM_SOLD |
---|---|---|---|---|
2023-11-01 | 1,000 | 1,000 | 0 | 0 |
2023-11-02 | 0 | 1,000 | 0 | 0 |
2023-11-03 | 0 | 1,000 | 0 | 0 |
2023-11-04 | 200 | 1,200 | 0 | 0 |
2023-11-05 | 0 | 1,200 | 0 | 0 |
2023-11-06 | 0 | 1,200 | 0 | 0 |
2023-11-07 | 50 | 1,250 | 0 | 0 |
2023-11-08 | 0 | 1,250 | 0 | 0 |
Each day we may purchase some units CURR_DATE_UNITS_PURCHASED
. The current total units purchased is LTD_UNITS_PURCHASED
.
Note:
LTD
stands for live-to-date, meaning a running total.
I'm doing some extrapolation and need to calculate the CURR_DATE_EXTRAP_NUM_SOLD
and LTD_EXTRAP_NUM_SOLD
fields (these are predicted units sold, based off below rules).
CURR_DATE_EXTRAP_NUM_SOLD
- Predicted units sold for day = 10% of remaining units.LTD_EXTRAP_NUM_SOLD
- This is running total of extrapolated units.LTD_UNITS_PURCHASED
- LTD_EXTRAP_NUM_SOLD
(previous day).The below example_data_exected
is the result I'm looking for (Note: this is the data that needs to be created, it is not available. I just created a table to make it easier for you to view).
CREATE OR REPLACE TABLE example_data_expected (
SALES_DATE DATE,
CURR_DATE_UNITS_PURCHASED NUMBER(10,1),
LTD_UNITS_PURCHASED NUMBER(10,1),
CURR_DATE_EXTRAP_NUM_SOLD NUMBER(10,1),
LTD_EXTRAP_NUM_SOLD NUMBER(10,1)
);
INSERT INTO example_data_expected VALUES
('2023-11-01', 1000, 1000, 100, 100), -- 1,000 left
('2023-11-02', 0, 1000, 90, 190), -- 900 left (1,000 - 100)
('2023-11-03', 0, 1000, 81, 271), -- 810 left (1,000 - 190)
('2023-11-04', 200, 1200, 92.9, 363.9), -- 929 left (1,200 - 271)
('2023-11-05', 0, 1200, 83.6, 447.5), -- 863.1 left (1,200 - 363.9)
('2023-11-06', 0, 1200, 75.3, 522.8), -- 752.5 left (1,200 - 447.5)
('2023-11-07', 50, 1250, 72.7, 595.5), -- 727.2 left (1,250 - 522.8)
('2023-11-08', 0, 1250, 65.5, 661.0); -- 654.5 left (1,250 - 595.5)
SALES_DATE | CURR_DATE_UNITS_PURCHASED | LTD_UNITS_PURCHASED | CURR_DATE_EXTRAP_NUM_SOLD | LTD_EXTRAP_NUM_SOLD |
---|---|---|---|---|
2023-11-01 | 1,000 | 1,000 | 100 | 100 |
2023-11-02 | 0 | 1,000 | 90 | 190 |
2023-11-03 | 0 | 1,000 | 81 | 271 |
2023-11-04 | 200 | 1,200 | 92.9 | 363.9 |
2023-11-05 | 0 | 1,200 | 83.6 | 447.5 |
2023-11-06 | 0 | 1,200 | 75.3 | 552.8 |
2023-11-07 | 50 | 1,250 | 72.7 | 595.5 |
2023-11-08 | 0 | 1,250 | 65.5 | 661.0 |
You can see on the first day, remaining units is whatever was purchased that first day - 1,000 - so 10% of that is 100, and the running extrap live-to-date (LTD) is 100.
On the second day, no units were purchased, so remaining units is 1,000 - 100 (previous day LTD_EXTRAP) = 900. So the extrapolated units for that day is 900 * 0.1 = 90. And the LTD_EXTRAP for today is 100 + 90 = 190.
And it keeps going like this.
I can't figure out any SQL to extrapolate these two fields (CURR_DATE_EXTRAP_NUM_SOLD
and LTD_EXTRAP_NUM_SOLD
) in a SELECT
. I tried using the LAG
window function, but this requires the previous number to be static (I think), or am I missing something about it. Or is there some way I can use an UPDATE
to update these fields?
Any help would be appreciated.
Note: I'm using Databricks SQL, which supports all ANSI-SQL, but also has it's own set of functions available. So a solution using Databricks SQL would also work.
Just for reference, this is what I tried with LAG
, and it didn't seem to work, as it is using the current EXTRAP_LTD (0.00).
SELECT
sales_date,
curr_date_units_purchased,
ltd_units_purchased,
(ltd_units_purchased - LAG(ltd_extrap_num_sold, 1, 0) OVER (ORDER BY sales_date)) AS remaining_units,
(remaining_units * 0.10) as curr_date_extrap_num_sold,
SUM(curr_date_extrap_num_sold) OVER (ORDER BY sales_date) AS ltd_extrap_num_sold
FROM example_data;
And this was the result
SALES_DATE | CURR_DATE_UNITS_PURCHASED | LTD_UNITS_PURCHASED | REMAINING_UNITS | CURR_DATE_EXTRAP_NUM_SOLD | LTD_EXTRAP_NUM_SOLD |
---|---|---|---|---|---|
2023-11-01 | 1,000 | 1,000 | 1,000 | 100 | 0 |
2023-11-02 | 0 | 1,000 | 1,000 | 100 | 0 |
2023-11-03 | 0 | 1,000 | 1,000 | 100 | 0 |
2023-11-04 | 200 | 1,200 | 1,200 | 120 | 0 |
2023-11-05 | 0 | 1,200 | 1,200 | 120 | 0 |
2023-11-06 | 0 | 1,200 | 1,200 | 120 | 0 |
2023-11-07 | 50 | 1,250 | 1,250 | 125 | 0 |
2023-11-08 | 0 | 1,250 | 1,250 | 125 | 0 |
Pls, see another solution.
with params as(select cast(0.9 as decimal(38,10)) x)
,cte1 as(
select t.*
,row_number()over(order by sales_date) k
,sum(CURR_DATE_UNITS_PURCHASED)over(order by sales_date) roll_sum_purchased
,params.x
from example_data_expected t
cross apply params
)
,cte2 as(
select *
,roll_sum_purchased
- (sum(CURR_DATE_UNITS_PURCHASED*power(x,1-k))over(order by sales_date))
*power(x,k) extrap_sum
from cte1
)
select SALES_DATE,k,CURR_DATE_UNITS_PURCHASED,roll_sum_purchased
,extrap_sum-lag(extrap_sum,1,0)over(order by SALES_DATE) curr_day_extrap
,CURR_DATE_EXTRAP_NUM_SOLD
,extrap_sum
,LTD_EXTRAP_NUM_SOLD
,LTD_EXTRAP_NUM_SOLD - extrap_sum check_extrap_sum
from cte2
Parameter x=0.9
can be passed in any other way or written everywhere in the request as a constant.
Manipulations with calculation of power(x,k)
can also be improved.
I add some data to test table
INSERT INTO example_data_expected VALUES
('2023-11-09', 0, 1250, 58.906521, 719.841311)
,('2023-11-10', 0, 1250, 53.0158689, 772.8571799)
,('2023-11-11', 0, 1250, 47.71428201, 820.5714619)
It is advisable to additionally check this values. I'll check with @Saikat query. The discrepancies are satisfactory.
Draft result of query with your and additional data
SALES_DATE | k | roll_sum_purchased | curr_day_extrap | CURR_DATE_EXTRAP_NUM_SOLD | extrap_sum | LTD_EXTRAP_NUM_SOLD | check_extrap_sum | CURR_DATE_UNITS_PURCHASED |
---|---|---|---|---|---|---|---|---|
2023-11-01 | 1 | 1000.0000000000 | 100.000000 | 100.0000000000 | 100.000000 | 100.0000000000 | 0.000000 | 1000.0000000000 |
2023-11-02 | 2 | 1000.0000000000 | 90.000000 | 90.0000000000 | 190.000000 | 190.0000000000 | 0.000000 | 0.0000000000 |
2023-11-03 | 3 | 1000.0000000000 | 81.000000 | 81.0000000000 | 271.000000 | 271.0000000000 | 0.000000 | 0.0000000000 |
2023-11-04 | 4 | 1200.0000000000 | 92.900000 | 92.9000000000 | 363.900000 | 363.9000000000 | 0.000000 | 200.0000000000 |
2023-11-05 | 5 | 1200.0000000000 | 83.610000 | 83.6000000000 | 447.510000 | 447.5000000000 | -0.010000 | 0.0000000000 |
2023-11-06 | 6 | 1200.0000000000 | 75.249000 | 75.3000000000 | 522.759000 | 522.8000000000 | 0.041000 | 0.0000000000 |
2023-11-07 | 7 | 1250.0000000000 | 72.724100 | 72.7000000000 | 595.483100 | 595.5000000000 | 0.016900 | 50.0000000000 |
2023-11-08 | 8 | 1250.0000000000 | 65.451690 | 65.5000000000 | 660.934790 | 661.0000000000 | 0.065210 | 0.0000000000 |
2023-11-09 | 9 | 1250.0000000000 | 58.906521 | 58.9065210000 | 719.841311 | 719.8413110000 | 0.000000 | 0.0000000000 |
2023-11-10 | 10 | 1250.0000000000 | 53.015869 | 53.0158689000 | 772.857180 | 772.8571799000 | 0.000000 | 0.0000000000 |
2023-11-11 | 11 | 1250.0000000000 | 47.714282 | 47.7142820100 | 820.571462 | 820.5714619000 | 0.000000 | 0.0000000000 |
Upd1.
About using power().
For purchase 1000 units, extrapolated sales
(extrap)=1000*0.1=100 and remains =1000*(1-0.1)=1000*0.9=1000*power(0.9,1)=900.
For second day
extrap=900*0.1=90 and remains 900*(1-0.1)=900*0.9=1000*0.9*0.9=1000*power(0.9,2)=1000*0.81=810
For day k
remains
1000*power(0.9,k)
tod ->a1,a2,a3 - current day purchased values(CURR_DATE_UNITS_PURCHASED)
a1k=1,a2k=4,a3k=7
rest - remains from purchased
LTD_EXTRAP_NUM_SOLD - rolling sum =(rolling sum purcased)-rest
xk=power(x,k), where k - day number
rest from a1 for first day -> a1*power(x,1) ->1000*power(0.9,1)->1000*0.9=900
extrap for first day for a1=1000.0-900.0=100
rest from a1 for day num=k =a1*power(x,k) ->1000*power(0.9,k)
rest from a2 for day num=k =a2*power(x,k-4+1) ->200*power(0.9,k)
rest from a3 for day num=k =a3*power(x,k-7+1) ->50*power(0.9,k)
rest from sum of (a1+a2) for day number k=4 is
a1*power(x,k)+a2*power(x,1)
->a1*power(x,k)+a2*power(x,k)/power(x,1-k)
->(a1+a2/power(x,1-k))*power(x,k)
->((a1+a2*power(x,k-1))*power(x,k)
->((a1*power(x,a1k-1)+a2*power(x,a2k-1))*power(x,k)
Note that for a1 with k=1 a1*power(x,1-k)*power(x,k)
->a1*power(x,0)*power(x,1)->a1*1.0*0.9
For row with k=7
((a1*power(x,a1k-1)+a2*power(x,a2k-1)+a3*power(x,a3k-1))*power(x,k)
Sou: sum(tod*power(x,todK))over(order by ...)*power(x,k)
SALES_DATE | k | xk | tod | rest from a1 | rest from a2 | rest from a3 | rest_sum |
---|---|---|---|---|---|---|---|
01.11.2023 | 1 | 0,9 | a1=1000 | a1*xk=900 | 0 | 0 | 900 |
02.11.2023 | 2 | 0,81 | 0 | 810 | 0 | 0 | 810 |
03.11.2023 | 3 | 0,729 | 0 | 729 | 0 | 0 | 729 |
04.11.2023 | 4 | 0,6561 | a2=200 | 656,1 | 180 | 0 | 836,1 |
05.11.2023 | 5 | 0,59049 | 0 | 590,49 | 162 | 0 | 752,49 |
06.11.2023 | 6 | 0,531441 | 0 | 531,441 | 145,8 | 0 | 677,241 |
07.11.2023 | 7 | 0,4782969 | a3=50 | 478,2969 | 131,22 | 45 | 654,5169 |
08.11.2023 | 8 | 0,43046721 | 0 | 430,46721 | 118,098 | 40,5 | 589,06521 |
09.11.2023 | 9 | 0,387420489 | 0 | 387,420489 | 106,2882 | 36,45 | 530,158689 |
10.11.2023 | 10 | 0,34867844 | 0 | 348,6784401 | 95,65938 | 32,805 | 477,14282 |
11.11.2023 | 11 | 0,313810596 | 0 | 313,8105961 | 86,093442 | 29,5245 | 429,428538 |
Note that to use this method, you need to study the effect of the length of the row sequence for calculation. As the value of k
increases, the used part of the multiplier mantissa decreases. For example, power(0.9,80)=0.000218474500528393.
This can reduce the accuracy of calculations (if you need accuracy).
For example for purchase of 1000 units, after 80 day remains 0.218474500528393 units and extrap is 0,0218474500528393 units.
It just seems that you don't need to use all the amounts sum(...)over (order by date of sale)
- all rows before. you can sum(...)over(order by sales_date rows ... nearest 100)
. In this case, same changes in calculations must be done.
Next variant. There should be no problems with the accuracy of calculations.
with cte0 as(
select t.*
,row_number()over(order by sales_date) k
,cast(0.9 as decimal(38,20)) x
from example_data_expected t
)
,cte1 as(
select t.*
,cast(CURR_DATE_UNITS_PURCHASED*power(x,1-k) as decimal(38,20)) dayP1_K
,cast(k*log(x) as decimal(38,20)) kxL
from cte0 t
)
,cte2 as(
select *
,sum(CURR_DATE_UNITS_PURCHASED)over(order by sales_date) -- roll_sum_purchased
-cast(exp(log((sum(dayP1_K)over(order by sales_date)))+kxL) as decimal(38,20)) extrap_sum
from cte1
)
,cte3 as(
select *
,extrap_sum-lag(extrap_sum,1,0)over(order by SALES_DATE) curr_day_extrap
from cte2
)
There used equality
x=exp(log(x))
log(x*y)=log(x)+log(y)
log(power(x,k))=k*log(x)