Search code examples
sqlsql-serveraggregate

Calculate the average value of the N cheapest items inside a GROUP BY


Using a table containing selling orders from a marketplace, defined by the items as an int, a quantity as an int, a price per units as an int and a time as a timestamp, I am trying to generate statistics on the orders.

The data look something like this:

Item_Id Qty Price Date
100 5 4 2000-01-01 00:00:00.000
100 5 2 2000-01-01 00:00:00.000
100 8 4 2000-01-01 00:00:00.000
100 89 1 2000-01-02 00:00:00.000
100 44 5 2000-01-02 00:00:00.000

I want to calculate the average value per items to obtain N items, grouped by items and for a given period
I want to get the lowest average value possible. The result expected should be something like that for item id 100, given N = 8 and a period of 1 day.

Item_Id Period Avrg_Value_Of_N_Items
100 2000-01-01 2.75 (5 units @ 2$ + 3 units @ 4$)
100 2000-01-02 1.00 (8 units @ 1$)

I made the following query to obtain the average value of the orders, grouped by items and period

SELECT [Item_Id]
    , AVG([Price]) as [AVG_Price]
    ,DATEADD(DAY, DATEDIFF(DAY, '2020', orders.[date]) / 1 * 1, '2020') as [Date]
FROM [orders] as orders
GROUP BY [Item_Id]
    , DATEDIFF(DAY, '2020', orders.[date]) / 1
ORDER BY [Item_Id] ASC
    , [date] ASC

This is almost what I need. I now want to replace the AVG() function with my own function to calculate the value of the N cheapest items.

Option 1: Scalar-Valued Custom Function

I created a function that take as a parameter a list of orders and the number of orders to average for (N). It orders the items in ascending order of price and use a cursor to select the N cheapest items and returns their average value.

It looks something like this

CREATE FUNCTION avrg_Of_N_Cheapest_Units(@orders [dbo].[Orders] READONLY, @n int)
RETURNS INT
AS
BEGIN
    DECLARE @Qty INT = 0;
    DECLARE @Price INT = 0;
    DECLARE @OrderCount INT = 0;
    DECLARE @OrderValue INT = 0;
    DECLARE cur CURSOR FAST_FORWARD FOR 
        SELECT [Qty], [Price]
        FROM @orders
        ORDER BY [Price] ASC;


    WHILE @OrderCount < @n
    BEGIN
        FETCH NEXT FROM cur
            INTO @Qty, @Price

        IF ((@OrderCount + @Qty) < @n)
        BEGIN
            SET @OrderValue += @Qty * @Price
            SET @OrderCount += @Qty
        END
        ELSE
        BEGIN
            SET @OrderValue += (@n - @OrdersCount) * @Price
            SET @OrderCount += (@n - @OrdersCount)
        END
    END
    RETURN (@OrderValue / @OrderCount)
END;

I am however unable to come up with a query that will call this function on the sub-group made by the GROUP BY clause of a query.

I am also worried about the performance that this function might produces. Cursors aren't recommended generally and I am going to call this function a lot. It should only have to fetch few rows per calls however as I am exiting as soon as it reaches its targeted items count and doesn't go through all the rows every calls.

Option 2: CLR User-Defined Aggregates Functions

I looked into creating a custom aggregate function in C#/.Net. This seems well documented and could be an interesting prospect. These functions however requires to be parallelisable by merging the results together once they all finishes. In my case, I don't see how I could parallelize this process. If the set of cheapest N items is split between 2 processes, there's no way to know during the merging operation.


Solution

  • The easiest way to solve this is to use a tally table and multiple common table expressions. Here is an explanation of each.

    1 - use a values statement to create a derived table using your data.

    2 - tally tables are a list of number.

    3 - explode out items given qty sold for given date, add row num

    4 - select from result 3 by row count. at this point it easy to calculate aggregates since you have the correct data set. If you want top expensive products, change window function to use price descending.

    ;
    -- 1 - user data
    WITH CTE_DATA (ITEM_ID, QTY, PRICE, SOLD_DATE)
    AS
    (
      SELECT * FROM 
      (
      VALUES
        (100, 5, 4, '2000-01-01 00:00:00.000'),
        (100, 5, 2, '2000-01-01 00:00:00.000'),
        (100, 8, 4, '2000-01-01 00:00:00.000'),
        (100, 89, 1, '2000-01-02 00:00:00.000'),
        (100, 44, 5, '2000-01-02 00:00:00.000')
      ) 
      AS D
      (
        ITEM_ID,
        QTY,
        PRICE,
        SOLD_DATE
      ) 
    ),
    
    -- 2 - small tally table
    TALLY(N) AS
    (
        SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
        FROM sys.all_columns a --CROSS JOIN sys.all_columns b
    ),
    
    -- 3 - explode by qty, rank by date, item and price
    EXPLODED AS
    (
    SELECT 
      SOLD_DATE, ITEM_ID, PRICE, N,
      ROW_NUMBER() OVER (PARTITION BY SOLD_DATE, ITEM_ID ORDER BY (PRICE)) AS R
    FROM 
      CTE_DATA AS D 
    CROSS APPLY 
      TALLY AS T WHERE D.QTY >= T.N
    )
    
    -- 4 - give me first 8 items per day, item
    SELECT * FROM EXPLODED WHERE R <= 8
    

    I am including screen shots to show the intermediate steps.

    enter image description here

    The first picture shows your data.

    The second picture shows the top 10 elements of the tally table.

    enter image description here

    The third picture shows the exploded data that has a row number R for ordering. We have 5 items at 2 dollars, 5 items at 4 dollars and 8 items at 4 dollars. The N value is the Nth item for random sale X line.

    enter image description here

    The last picture shows the 8 lowest price items sold for a given day and item number. It is interesting that there is no order to the data. Thus the N column might be picking items from the lot way down in the list, but it does not matter since multiple items have the same price.

    Lets quickly calculate the average price.

    -- average price for 8 low cost items per day and per item no
    SELECT SOLD_DATE, ITEM_ID, AVG(CAST(PRICE AS FLOAT)) AS AVG_PRICE 
    FROM EXPLODED WHERE R <= 8
    GROUP BY SOLD_DATE, ITEM_ID
    

    The correct results are shown below.

    enter image description here