Search code examples
postgresqllarge-dataolap

Union-all postgresql select clauses preserving order


Having complex SQL query to RDBMS Postgresql which consists of multiple nested UNION ALL-like nested queries, something like this:

(
  (
     (<QUERY 1-1-1> UNION ALL <QUERY 1-1-2>) UNION ALL
     (<QUERY 1-1-3> UNION ALL <QUERY 1-1-4>) UNION ALL
     ...
  ) UNION ALL
  (
     (<QUERY 1-2-1> UNION ALL <QUERY 1-2-2>) UNION ALL
     (<QUERY 1-2-3> UNION ALL <QUERY 1-2-4>) UNION ALL
     ...
  ) UNION ALL
  ...
) UNION ALL
(
  (
     (<QUERY 2-1-1> UNION ALL <QUERY 2-1-2>) UNION ALL
     (<QUERY 2-1-3> UNION ALL <QUERY 2-1-4>) UNION ALL
     ...
  ) UNION ALL
  (
     (<QUERY 2-2-1> UNION ALL <QUERY 2-2-2>) UNION ALL
     (<QUERY 2-2-3> UNION ALL <QUERY 2-2-4>) UNION ALL
     ...
  ) UNION ALL
  ...
) UNION ALL
(
  ...
)

Each <QUERY i-th> is relatively lightweight query which produces about 100K-1M rows and can be sorted in-memory without significant performance impact.

Result query is consists of tens thousands multi-level nested UNION ALL queries in strict conventional order, like traversing tree in depth, so result query is several billion rows dataset.

So question is: since SQL does not guarantee order of UNION ALL statement, outer query should contain ORDER BY clause, but server hardware cannot perform sorting of billon rows in required time.

However, order of united queries is strict determined, and should be: <QUERY 1-1-1>, <QUERY 1-1-2> and so on, sorted hierarchically, so in fact sorting of outer query is redundant, since dataset is already sorted by sql query structure.

It's necessary to force Postgres to preserve order of nested UNION ALL statements. How to do it? Any plugins, extensions and even dirty hacks are welcome.

Please avoid of answers and comments mention XY-like problem - question is formulated as-is in research manner. Structure of database and dataset cannot be changed by conditions of question. Thanks.


Solution

  • Try this - allocate the queries' results into a temporary table. Here it is step by step:

    • Create a temporary table ex. the_temp_table like the the record type of <QUERY 1-1-1>
    create temporary table the_temp_table as <QUERY 1-1-1> limit 0;
    
    • Add an auto-increment primary key column extra_id to the_temp_table
    alter table the_temp_table add column extra_id serial primary key not null;
    
    • Then run all your queries one by one in the right order
    insert into the_temp_table <QUERY 1-1-1>; insert into the_temp_table <QUERY 1-1-2>;
    insert into the_temp_table <QUERY 1-1-3>; insert into the_temp_table <QUERY 1-1-4>;
    
    insert into the_temp_table <QUERY 1-2-1>; insert into the_temp_table <QUERY 1-2-2>;
    insert into the_temp_table <QUERY 1-2-3>; insert into the_temp_table <QUERY 1-2-4>;
    
    -- continue 
    
    • Finally
    select <fields list w/o extra_id> from the_temp_table order by extra_id;
    
    -- no sorting is taking place here
    

    Effectively thus you will be emulating UNION ALL in a controlled manner with an insignificant performance penalty.