Search code examples
sqlapache-spark-sqldatabrickswindow-functions

Is there a SQL window function to use a calculated value from a previous row?


I need to calculate a row's value based on a calculated value from the previous row. I'm looking for some window function (or other solution), if there is one, that can help me do this. For context, we have the following data:

CREATE OR REPLACE TABLE example_data (
    SALES_DATE DATE,
    CURR_DATE_UNITS_PURCHASED NUMBER(10,1),
    LTD_UNITS_PURCHASED NUMBER(10,1),
    CURR_DATE_EXTRAP_NUM_SOLD NUMBER(10,1),
    LTD_EXTRAP_NUM_SOLD NUMBER(10,1)
);
INSERT INTO example_data VALUES
('2023-11-01', 1000, 1000, 0, 0),
('2023-11-02', 0,    1000, 0, 0),
('2023-11-03', 0,    1000, 0, 0),
('2023-11-04', 200,  1200, 0, 0),
('2023-11-05', 0,    1200, 0, 0),
('2023-11-06', 0,    1200, 0, 0),
('2023-11-07', 50,   1250, 0, 0),
('2023-11-08', 0,    1250, 0, 0);
SALES_DATE CURR_DATE_UNITS_PURCHASED LTD_UNITS_PURCHASED CURR_DATE_EXTRAP_NUM_SOLD LTD_EXTRAP_NUM_SOLD
2023-11-01 1,000 1,000 0 0
2023-11-02 0 1,000 0 0
2023-11-03 0 1,000 0 0
2023-11-04 200 1,200 0 0
2023-11-05 0 1,200 0 0
2023-11-06 0 1,200 0 0
2023-11-07 50 1,250 0 0
2023-11-08 0 1,250 0 0

Each day we may purchase some units CURR_DATE_UNITS_PURCHASED. The current total units purchased is LTD_UNITS_PURCHASED.

Note: LTD stands for live-to-date, meaning a running total.

I'm doing some extrapolation and need to calculate the CURR_DATE_EXTRAP_NUM_SOLD and LTD_EXTRAP_NUM_SOLD fields (these are predicted units sold, based off below rules).

  • CURR_DATE_EXTRAP_NUM_SOLD - Predicted units sold for day = 10% of remaining units.
  • LTD_EXTRAP_NUM_SOLD - This is running total of extrapolated units.
  • Remaining units = LTD_UNITS_PURCHASED - LTD_EXTRAP_NUM_SOLD (previous day).

The below example_data_exected is the result I'm looking for (Note: this is the data that needs to be created, it is not available. I just created a table to make it easier for you to view).

CREATE OR REPLACE TABLE example_data_expected (
    SALES_DATE DATE,
    CURR_DATE_UNITS_PURCHASED NUMBER(10,1),
    LTD_UNITS_PURCHASED NUMBER(10,1),
    CURR_DATE_EXTRAP_NUM_SOLD NUMBER(10,1),
    LTD_EXTRAP_NUM_SOLD NUMBER(10,1)
);
INSERT INTO example_data_expected VALUES
('2023-11-01', 1000, 1000, 100,  100),   -- 1,000 left
('2023-11-02', 0,    1000, 90,   190),   -- 900 left (1,000 - 100)
('2023-11-03', 0,    1000, 81,   271),   -- 810 left (1,000 - 190)
('2023-11-04', 200,  1200, 92.9, 363.9), -- 929 left (1,200 - 271)
('2023-11-05', 0,    1200, 83.6, 447.5), -- 863.1 left (1,200 - 363.9)
('2023-11-06', 0,    1200, 75.3, 522.8), -- 752.5 left (1,200 - 447.5)
('2023-11-07', 50,   1250, 72.7, 595.5), -- 727.2 left (1,250 - 522.8)
('2023-11-08', 0,    1250, 65.5, 661.0); -- 654.5 left (1,250 - 595.5)
SALES_DATE CURR_DATE_UNITS_PURCHASED LTD_UNITS_PURCHASED CURR_DATE_EXTRAP_NUM_SOLD LTD_EXTRAP_NUM_SOLD
2023-11-01 1,000 1,000 100 100
2023-11-02 0 1,000 90 190
2023-11-03 0 1,000 81 271
2023-11-04 200 1,200 92.9 363.9
2023-11-05 0 1,200 83.6 447.5
2023-11-06 0 1,200 75.3 552.8
2023-11-07 50 1,250 72.7 595.5
2023-11-08 0 1,250 65.5 661.0

You can see on the first day, remaining units is whatever was purchased that first day - 1,000 - so 10% of that is 100, and the running extrap live-to-date (LTD) is 100.

On the second day, no units were purchased, so remaining units is 1,000 - 100 (previous day LTD_EXTRAP) = 900. So the extrapolated units for that day is 900 * 0.1 = 90. And the LTD_EXTRAP for today is 100 + 90 = 190.

And it keeps going like this.

I can't figure out any SQL to extrapolate these two fields (CURR_DATE_EXTRAP_NUM_SOLD and LTD_EXTRAP_NUM_SOLD) in a SELECT. I tried using the LAG window function, but this requires the previous number to be static (I think), or am I missing something about it. Or is there some way I can use an UPDATE to update these fields?

Any help would be appreciated.

Note: I'm using Databricks SQL, which supports all ANSI-SQL, but also has it's own set of functions available. So a solution using Databricks SQL would also work.

UPDATE

Just for reference, this is what I tried with LAG, and it didn't seem to work, as it is using the current EXTRAP_LTD (0.00).

SELECT
    sales_date,
    curr_date_units_purchased,
    ltd_units_purchased,
    (ltd_units_purchased - LAG(ltd_extrap_num_sold, 1, 0) OVER (ORDER BY sales_date)) AS remaining_units,
    (remaining_units * 0.10) as curr_date_extrap_num_sold,
    SUM(curr_date_extrap_num_sold) OVER (ORDER BY sales_date) AS ltd_extrap_num_sold
FROM example_data;

And this was the result

SALES_DATE CURR_DATE_UNITS_PURCHASED LTD_UNITS_PURCHASED REMAINING_UNITS CURR_DATE_EXTRAP_NUM_SOLD LTD_EXTRAP_NUM_SOLD
2023-11-01 1,000 1,000 1,000 100 0
2023-11-02 0 1,000 1,000 100 0
2023-11-03 0 1,000 1,000 100 0
2023-11-04 200 1,200 1,200 120 0
2023-11-05 0 1,200 1,200 120 0
2023-11-06 0 1,200 1,200 120 0
2023-11-07 50 1,250 1,250 125 0
2023-11-08 0 1,250 1,250 125 0

Solution

  • Pls, see another solution.

    with params as(select cast(0.9 as decimal(38,10)) x)
    ,cte1 as(
    select t.*
      ,row_number()over(order by sales_date) k
      ,sum(CURR_DATE_UNITS_PURCHASED)over(order by sales_date) roll_sum_purchased
      ,params.x
    from example_data_expected t
    cross apply params
    )
    ,cte2 as(
    select * 
      ,roll_sum_purchased 
       -  (sum(CURR_DATE_UNITS_PURCHASED*power(x,1-k))over(order by sales_date)) 
          *power(x,k) extrap_sum
    from cte1
    )
    select SALES_DATE,k,CURR_DATE_UNITS_PURCHASED,roll_sum_purchased 
      ,extrap_sum-lag(extrap_sum,1,0)over(order by SALES_DATE) curr_day_extrap
      ,CURR_DATE_EXTRAP_NUM_SOLD
      ,extrap_sum
      ,LTD_EXTRAP_NUM_SOLD
      ,LTD_EXTRAP_NUM_SOLD - extrap_sum check_extrap_sum
    from cte2
    

    Parameter x=0.9 can be passed in any other way or written everywhere in the request as a constant.
    Manipulations with calculation of power(x,k) can also be improved.

    I add some data to test table

    INSERT INTO example_data_expected VALUES
     ('2023-11-09', 0,    1250, 58.906521, 719.841311)
    ,('2023-11-10', 0,    1250, 53.0158689, 772.8571799)
    ,('2023-11-11', 0,    1250, 47.71428201, 820.5714619)
    

    It is advisable to additionally check this values. I'll check with @Saikat query. The discrepancies are satisfactory.

    Draft result of query with your and additional data

    SALES_DATE k roll_sum_purchased curr_day_extrap CURR_DATE_EXTRAP_NUM_SOLD extrap_sum LTD_EXTRAP_NUM_SOLD check_extrap_sum CURR_DATE_UNITS_PURCHASED
    2023-11-01 1 1000.0000000000 100.000000 100.0000000000 100.000000 100.0000000000 0.000000 1000.0000000000
    2023-11-02 2 1000.0000000000 90.000000 90.0000000000 190.000000 190.0000000000 0.000000 0.0000000000
    2023-11-03 3 1000.0000000000 81.000000 81.0000000000 271.000000 271.0000000000 0.000000 0.0000000000
    2023-11-04 4 1200.0000000000 92.900000 92.9000000000 363.900000 363.9000000000 0.000000 200.0000000000
    2023-11-05 5 1200.0000000000 83.610000 83.6000000000 447.510000 447.5000000000 -0.010000 0.0000000000
    2023-11-06 6 1200.0000000000 75.249000 75.3000000000 522.759000 522.8000000000 0.041000 0.0000000000
    2023-11-07 7 1250.0000000000 72.724100 72.7000000000 595.483100 595.5000000000 0.016900 50.0000000000
    2023-11-08 8 1250.0000000000 65.451690 65.5000000000 660.934790 661.0000000000 0.065210 0.0000000000
    2023-11-09 9 1250.0000000000 58.906521 58.9065210000 719.841311 719.8413110000 0.000000 0.0000000000
    2023-11-10 10 1250.0000000000 53.015869 53.0158689000 772.857180 772.8571799000 0.000000 0.0000000000
    2023-11-11 11 1250.0000000000 47.714282 47.7142820100 820.571462 820.5714619000 0.000000 0.0000000000

    Fiddle here

    Upd1.
    About using power(). For purchase 1000 units, extrapolated sales

    (extrap)=1000*0.1=100 and remains =1000*(1-0.1)=1000*0.9=1000*power(0.9,1)=900. 
    

    For second day

    extrap=900*0.1=90 and remains 900*(1-0.1)=900*0.9=1000*0.9*0.9=1000*power(0.9,2)=1000*0.81=810  
    

    For day k remains

    1000*power(0.9,k)  
    
    tod ->a1,a2,a3 - current day purchased values(CURR_DATE_UNITS_PURCHASED)  
    a1k=1,a2k=4,a3k=7
    rest - remains from purchased
    LTD_EXTRAP_NUM_SOLD - rolling sum =(rolling sum purcased)-rest  
    xk=power(x,k), where k - day number  
    
    rest from a1 for first day -> a1*power(x,1) ->1000*power(0.9,1)->1000*0.9=900  
    extrap for first day for a1=1000.0-900.0=100  
    rest from a1 for day num=k  =a1*power(x,k) ->1000*power(0.9,k)  
    rest from a2 for day num=k  =a2*power(x,k-4+1) ->200*power(0.9,k)  
    rest from a3 for day num=k  =a3*power(x,k-7+1) ->50*power(0.9,k)  
    rest from sum of (a1+a2)  for day number k=4 is  
    a1*power(x,k)+a2*power(x,1)  
    ->a1*power(x,k)+a2*power(x,k)/power(x,1-k)  
    ->(a1+a2/power(x,1-k))*power(x,k)
    ->((a1+a2*power(x,k-1))*power(x,k)  
    ->((a1*power(x,a1k-1)+a2*power(x,a2k-1))*power(x,k)  
    Note that for a1 with k=1 a1*power(x,1-k)*power(x,k)  
    ->a1*power(x,0)*power(x,1)->a1*1.0*0.9
    For row with k=7  
    ((a1*power(x,a1k-1)+a2*power(x,a2k-1)+a3*power(x,a3k-1))*power(x,k)  
    Sou: sum(tod*power(x,todK))over(order by ...)*power(x,k)
    
    
    SALES_DATE k xk tod rest from a1 rest from a2 rest from a3 rest_sum
    01.11.2023 1 0,9 a1=1000 a1*xk=900 0 0 900
    02.11.2023 2 0,81 0 810 0 0 810
    03.11.2023 3 0,729 0 729 0 0 729
    04.11.2023 4 0,6561 a2=200 656,1 180 0 836,1
    05.11.2023 5 0,59049 0 590,49 162 0 752,49
    06.11.2023 6 0,531441 0 531,441 145,8 0 677,241
    07.11.2023 7 0,4782969 a3=50 478,2969 131,22 45 654,5169
    08.11.2023 8 0,43046721 0 430,46721 118,098 40,5 589,06521
    09.11.2023 9 0,387420489 0 387,420489 106,2882 36,45 530,158689
    10.11.2023 10 0,34867844 0 348,6784401 95,65938 32,805 477,14282
    11.11.2023 11 0,313810596 0 313,8105961 86,093442 29,5245 429,428538

    Note that to use this method, you need to study the effect of the length of the row sequence for calculation. As the value of k increases, the used part of the multiplier mantissa decreases. For example, power(0.9,80)=0.000218474500528393.
    This can reduce the accuracy of calculations (if you need accuracy).
    For example for purchase of 1000 units, after 80 day remains 0.218474500528393 units and extrap is 0,0218474500528393 units.
    It just seems that you don't need to use all the amounts sum(...)over (order by date of sale) - all rows before. you can sum(...)over(order by sales_date rows ... nearest 100). In this case, same changes in calculations must be done.

    Next variant. There should be no problems with the accuracy of calculations.

    with cte0 as(
    select t.*
      ,row_number()over(order by sales_date) k
      ,cast(0.9 as decimal(38,20)) x
    from example_data_expected t
    )
    ,cte1 as(
    select t.*
      ,cast(CURR_DATE_UNITS_PURCHASED*power(x,1-k) as decimal(38,20)) dayP1_K
      ,cast(k*log(x) as decimal(38,20)) kxL
    from cte0 t
    )
    ,cte2 as(
    select * 
      ,sum(CURR_DATE_UNITS_PURCHASED)over(order by sales_date)  -- roll_sum_purchased
       -cast(exp(log((sum(dayP1_K)over(order by sales_date)))+kxL) as decimal(38,20)) extrap_sum
    from cte1
    )
    ,cte3 as(
    select *
      ,extrap_sum-lag(extrap_sum,1,0)over(order by SALES_DATE) curr_day_extrap
    from cte2
    )
    

    There used equality

    x=exp(log(x))
    log(x*y)=log(x)+log(y)
    log(power(x,k))=k*log(x)