Search code examples
arrayspostgresqlsql-order-bypostgresql-12array-agg

How to respect the order of an array in a PostgreSQL select sentence


This is my (extremely simplified) product table and some test data.

drop table if exists product cascade;

create table product (
  product_id  integer not null,
  reference   varchar,
  price       decimal(13,4),
  
  primary key (product_id)
);

insert into product (product_id, reference, price) values 
(1001, 'MX-232',    100.00),
(1011, 'AX-232',     20.00),
(1003, 'KKK 11',     11.00),
(1004, 'OXS SUPER',   0.35),
(1005, 'ROR-MOT',   200.00),
(1006, '234PPP',     30.50),
(1007, 'T555-NS',   110.25),
(1008, 'LM234-XS',  101.20),
(1009, 'MOTOR-22',   12.50),
(1010, 'MOTOR-11',   30.00),
(1002, 'XUL-XUL1',   40.00);

I real life, listing product columns is a taught task, full of joins, case-when-end clauses, etc. On the other hand, there is a large number of queries to be fulfilled, as products by brand, featured products, products by title, by tags, by range or price, etc.

I don't want to repeat and maintain the complex product column listings every time I perform a query so, my current approach is breaking query processes in two tasks:

  • encapsulate the query in functions of type select_products_by_xxx(), that return product_id arrays, properly selected and ordered.
  • encapsulate all the product column complexity in a unique function list_products() that takes a product_id array as a parameter.
  • execute select * from list_products(select_products_by_xxx()) to obtain the desired result for every xxx function.

For example, to select product_id in reverse order (in case this was any meaningful select for the application), a function like this would do the case.

create or replace function select_products_by_inverse () 
returns int[]
as $$
  select 
    array_agg(product_id order by product_id desc)  
  from 
    product;
$$ language sql;

It can be tested to work as

select * from select_products_by_inverse();

select_products_by_inverse                              |
--------------------------------------------------------|
{1011,1010,1009,1008,1007,1006,1005,1004,1003,1002,1001}|

To encapsulate the "listing" part of the query I use this function (again, extremely simplified and without any join or case for the benefit of the example).

create or replace function list_products (
    tid int[]
) 
returns table (
  id        integer,
  reference varchar,
  price     decimal(13,4)
)
as $$
  select
    product_id,
    reference,
    price
  from
    product
  where
    product_id = any (tid);
$$ language sql;

It works, but does not respect the order of products in the passed array.

select * from list_products(select_products_by_inverse());

id  |reference|price   |
----|---------|--------|
1001|MX-232   |100.0000|
1011|AX-232   | 20.0000|
1003|KKK 11   | 11.0000|
1004|OXS SUPER|  0.3500|
1005|ROR-MOT  |200.0000|
1006|234PPP   | 30.5000|
1007|T555-NS  |110.2500|
1008|LM234-XS |101.2000|
1009|MOTOR-22 | 12.5000|
1010|MOTOR-11 | 30.0000|
1002|XUL-XUL1 | 40.0000|

So, the problem is I am passing a custom ordered array of product_id but the list_products() function does not respect the order inside the array.

Obviously, I could include an order by clause in list_products(), but remember that the ordering must be determined by the select_products_by_xxx() functions to keep the list_products() unique.

Any idea?


EDIT

@adamkg solution is simple and works: adding a universal order by clause like this:

order by array_position(tid, product_id);

However, this means to ordering products twice: first inside select_products_by_xxx() and then inside list_products().

An explain exploration renders the following result:

QUERY PLAN                                                            |
----------------------------------------------------------------------|
Sort  (cost=290.64..290.67 rows=10 width=56)                          |
  Sort Key: (array_position(select_products_by_inverse(), product_id))|
  ->  Seq Scan on product  (cost=0.00..290.48 rows=10 width=56)       |
        Filter: (product_id = ANY (select_products_by_inverse()))     |

Now I am wondering if there is any other better approach to reduce cost, keeping separability between functions.

I see two promising strategies:

  • As for the explain clause and the issue itself, it seems that an complete scan of table product is being done inside list_products(). As there may be thousands of products, a better approach would be to scan the passed array instead.
  • The xxx functions can be refactored to return setof int instead of int[]. However, a set cannot be passed as a function parameter.

Solution

  • For long arrays you typically get (much!) more efficient query plans with unnesting the array and joining to the main table. In simple cases, this even preserves the original order of the array without adding ORDER BY. Rows are processed in order. But there are no guarantees and the order may be broken with more joins or with parallel execution etc. To be sure, add WITH ORDINALITY:

    CREATE OR REPLACE FUNCTION list_products (tid int[])  -- VARIADIC?
      RETURNS TABLE (
       id        integer,
       reference varchar,
       price     decimal(13,4)
       )
      LANGUAGE sql STABLE AS
    $func$
      SELECT product_id, p.reference, p.price
      FROM   unnest(tid) WITH ORDINALITY AS t(product_id, ord)
      JOIN   product p USING (product_id)  -- LEFT JOIN ?
      ORDER  BY t.ord
    $func$;
    

    Fast, simple, safe. See:

    You might want to throw in the modifier VARIADIC, so you can call the function with an array or a list of IDs (max 100 items by default). See:

    I would declare STABLE function volatility.

    You might use LEFT JOIN instead of JOIN to make sure that all given IDs are returned - with NULL values if a row with given ID has gone missing.

    db<>fiddle here

    Note a subtle logic difference with duplicates in the array. While product_id is UNIQUE ...

    • unnest + left join returns exactly one row for every given ID - preserving duplicates in the given IDs if any.
    • product_id = any (tid) folds duplicates. (One of the reasons it typically results in more expensive query plans.)

    If there are no dupes in the given array, there is no difference. If there can be duplicates and you want to fold them, your task is ambiguous, as it's undefined which position to keep.