Search code examples
sqlpostgresqlaggregate-functionsplpgsqlcommon-table-expression

Postgres: limit by the results of a sum function


CREATE TABLE inventory_box (
 box_id varchar(10),
 value   integer
);

INSERT INTO inventory_box VALUES ('1', 10), ('2', 15), ('3', 20);

I prepared a sql fiddle with the schema.

I would like to select a list of inventory boxes with combined value of above 20

possible result 1. box 1 + box 2 (10 + 15 >= 20) 

Here is what I am doing right now:

SELECT * FROM inventory_box LIMIT 1 OFFSET 0;
-- count on the client side and see if I got enough
-- got 10
SELECT * FROM inventory_box LIMIT 1 OFFSET 1;
-- count on the client side and see if I got enough
-- got 15, add it to the first query which returned 10
-- total is 25, ok, got enough, return answer

I am looking for a solution where the scan will stop as soon as it reaches the target value


Solution

  • Your question update clarifies that your actual requirements are much simpler than a full-blown "subset sum problem" as suspected by Ulrich:
    Just fetch rows until the sum is big enough.

    I sort by box_id to get deterministic results. You might even drop the ORDER BY altogether to get any valid result a bit faster, yet.

    Slow: Recursive CTE

    WITH RECURSIVE i AS (
       SELECT *, row_number() OVER (ORDER BY box_id) AS rn
       FROM   inventory_box
       )
    , r AS (
       SELECT box_id, val, val AS total, 2 AS rn
       FROM   i
       WHERE  rn = 1
    
       UNION ALL
       SELECT i.box_id, i.val, r.total + i.val, r.rn + 1
       FROM   r
       JOIN   i USING (rn)
       WHERE  r.total < 20
       )
    SELECT box_id, val, total
    FROM   r
    ORDER  BY box_id;
    

    Fast: PL/pgSQL function with FOR loop

    Using sum() as window aggregate function (cheapest).

    CREATE OR REPLACE FUNCTION f_shop_for(_total int)
      RETURNS TABLE (box_id text, val int, total int)
      LANGUAGE plpgsql STABLE AS
    $func$
    BEGIN
       total := 0;
    
       FOR box_id, val, total IN
          SELECT i.box_id, i.val
               , sum(i.val) OVER (ORDER BY i.box_id) AS total
          FROM   inventory_box i
       LOOP
          RETURN NEXT;
          EXIT WHEN total >= _total;
       END LOOP; 
    END
    $func$;
    

    Call:

    SELECT * FROM f_shop_for(20);
    

    I tested both with a big table of 1 million rows. The function only reads the necessary rows from index and table. The CTE is very slow, seems to scan the whole table ...

    fiddle
    OLD sqlfiddle

    Aside: sorting by a varchar column (box_id) with numeric data yields dubious results. Should be a numeric type, really?