Search code examples
postgresqlgroup-bypostgresql-9.3sql-viewforeign-data-wrapper

Use postgres_fdw to speed up view containing multiple self-join on group-by


(warning apologetics and hackery incomming...)

Background:

I have a legacy application for which I'd like to avoid rewriting a lot of its SQL code. I'm trying to speed up specific types of very costly queries that it does (i.e: the low hanging fruit).

It has a financial transaction ledger represented by a transactions table. When a new row is inserted a trigger function (not shown here) carries forward the new balance for a given entity.

Some types of transaction model externalities (like in-flight payments) by tagging new transactions with a 'related' transaction so that the application can group related transactions together.

\d transactions

                  Table "public.transactions"
       Column        |   Type    | Modifiers 
---------------------+-----------+-----------
 entityid            | bigint    | not null
 transactionid       | bigint    | not null default nextval('tid_seq')
 type                | smallint  | not null
 status              | smallint  | not null
 related             | bigint    | 
 amount              | bigint    | not null
 abs_amount          | bigint    | not null
 is_credit           | boolean   | not null
 inserted            | timestamp | not null default now()
 description         | text      | not null
 balance             | bigint    | not null

Indexes:
    "transactions_pkey" PRIMARY KEY, btree (transactionid)
    "transactions by entityid" btree (entityid)
    "transactions by initial trans" btree ((COALESCE(related, transactionid)))

Foreign-key constraints:
    "invalid related transaction!" FOREIGN KEY (related) 
                                   REFERENCES transactions(transactionid)

In my test data set, I have:

  • roughly 5.5 million rows total
  • roughly 3.7 million rows with no 'related' transaction
  • roughly 1.8 million rows with 'related' transaction
  • roughly 55k distinct entityids (customers).

So about 1/3 of all transaction rows are updates 'related' to some earlier transaction. The production data is about 25 times bigger transactionid-wise and about 8 times bigger distinct entityid-wise, and the 1/3 ratio carries through for transaction updates.

The code queries a particularly ineffecient VIEW which was defined as:

CREATE VIEW collapsed_transactions AS
SELECT t.entityid,
    g.initial,
    g.latest,
    i.inserted AS created,
    t.inserted AS updated,
    t.type,
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
FROM ( SELECT 
          COALESCE(x.related, x.transactionid) AS initial,
          max(x.transactionid) AS latest
       FROM transactions x
       GROUP BY COALESCE(x.related, x.transactionid)
     ) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;

and typical queries take the form:

SELECT * FROM collapsed_transactions WHERE entityid = 204425;

As you can see, the where entityid = 204425 clause will not be used to constrain the GROUP BY sub-query so all entitids' transactions will be grouped resulting in 55,000 bigger sub-query resultset and stupidly longer query time... all to arrive at an average of 40 rows (71 in this example) at the time of writing.

I can't normalise the transactions table further (by say having initial_transactions and updated_transactions tables joined by related) without rewriting hundreds of the codebase's SQL queries, many of which use the self-join semantics in different ways.

Insight:

I initially tried rewriting the queries using WINDOW functions but ran into all sorts of trouble with that (another SO question for another time), when I saw that the www_fdw passes its WHERE clause to HTTP as GET/POST parameters, I was very intriged by the possibility of very naiive queries being optimised without much restructuring.

The Postgresql 9.3 manual says:

F.31.4. Remote Query Optimization

postgres_fdw attempts to optimize remote queries to reduce the amount of data transferred from foreign servers. This is done by sending query WHERE clauses to the remote server for execution, and by not retrieving table columns that are not needed for the current query. To reduce the risk of misexecution of queries, WHERE clauses are not sent to the remote server unless they use only built-in data types, operators, and functions. Operators and functions in the clauses must be IMMUTABLE as well.

The query that is actually sent to the remote server for execution can be examined using EXPLAIN VERBOSE.

Attempt:

So I though that perhaps I could put the GROUP-BY into a view, treat that view as a foreign table and that the optimiser would pass through the WHERE clause to that foreign table resulting in a much more efficient query....

CREATE VIEW foreign_transactions_grouped_by_initial_transaction AS 
  SELECT
    entityid,
    COALESCE(t.related, t.transactionid) AS initial,
    MAX(t.transactionid) AS latest
  FROM transactions t
  GROUP BY
    t.entityid,
    COALESCE(t.related, t.transactionid);

CREATE FOREIGN TABLE transactions_grouped_by_initial_transaction 
  (entityid bigint, initial bigint, latest bigint) 
  SERVER local_pg_server 
  OPTIONS (table_name 'foreign_transactions_grouped_by_initial_transaction');

EXPLAIN ANALYSE VERBOSE
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial 
  WHERE g.entityid = 204425;

and that works very nicely!

 Nested Loop  (cost=100.87..305.05 rows=10 width=116) 
              (actual time=4.113..16.646 rows=71 loops=1)
   Output: t.entityid, g.initial, g.latest, i.inserted, 
           t.inserted, t.type, t.status, t.amount, t.abs_amount, 
           t.balance, t.description
   ->  Nested Loop  (cost=100.43..220.42 rows=10 width=108) 
                    (actual time=4.017..10.725 rows=71 loops=1)
         Output: g.initial, g.latest, t.entityid, t.inserted, 
                 t.type, t.status, t.amount, t.abs_amount, t.is_credit,
                 t.balance, t.description
     ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
                 (cost=100.00..135.80 rows=10 width=16) 
                 (actual time=3.914..4.694 rows=71 loops=1)
            Output: g.entityid, g.initial, g.latest
            Remote SQL: 
              SELECT initial, latest
              FROM public.foreign_transactions_grouped_by_initial_transaction
              WHERE ((entityid = 204425))
         ->  Index Scan using transactions_pkey on public.transactions t  
                  (cost=0.43..8.45 rows=1 width=100) 
                  (actual time=0.023..0.035 rows=1 loops=71)
               Output: t.entityid, t.transactionid, t.type, t.status, 
                       t.related, t.amount, t.abs_amount, t.is_credit, 
                       t.inserted, t.description, t.balance
               Index Cond: (t.transactionid = g.latest)
   ->  Index Scan using transactions_pkey on public.transactions i  
            (cost=0.43..8.45 rows=1 width=16) 
            (actual time=0.021..0.033 rows=1 loops=71)
         Output: i.entityid, i.transactionid, i.type, i.status, 
                 i.related, i.amount, i.abs_amount, i.is_credit, 
                 i.inserted, i.description, i.balance
         Index Cond: (i.transactionid = g.initial)
 Total runtime: 20.363 ms

Problem:

However, when I try to bake that into a VIEW (with or without another layer of postgres_fdw) the query optimiser doesn't seem to pass through the WHERE clause :-(

CREATE view collapsed_transactions_fast AS
  SELECT 
    t.entityid, 
    g.initial, 
    g.latest, 
    i.inserted AS created, 
    t.inserted AS updated, 
    t.type, 
    t.status,
    t.amount,
    t.abs_amount,
    t.is_credit,
    t.balance,
    t.description
  FROM transactions_grouped_by_initial_transaction g 
  INNER JOIN transactions t on t.transactionid = g.latest
  INNER JOIN transactions i on i.transactionid = g.initial;

EXPLAIN ANALYSE VERBOSE
  SELECT * FROM collapsed_transactions_fast WHERE entityid = 204425; 

results in:

Nested Loop  (cost=534.97..621.88 rows=1 width=117) 
             (actual time=104720.383..139307.940 rows=71 loops=1)
  Output: t.entityid, g.initial, g.latest, i.inserted, t.inserted, t.type, 
          t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
          t.description
  ->  Hash Join  (cost=534.53..613.66 rows=1 width=109) 
                 (actual time=104720.308..139305.522 rows=71 loops=1)
        Output: g.initial, g.latest, t.entityid, t.inserted, t.type, 
                t.status, t.amount, t.abs_amount, t.is_credit, t.balance, 
                t.description
        Hash Cond: (g.latest = t.transactionid)
    ->  Foreign Scan on public.transactions_grouped_by_initial_transaction g
         (cost=100.00..171.44 rows=2048 width=16) 
         (actual time=23288.569..108916.051 rows=3705600 loops=1)
           Output: g.entityid, g.initial, g.latest
           Remote SQL: 
            SELECT initial, latest 
            FROM public.foreign_transactions_grouped_by_initial_transaction
        ->  Hash  (cost=432.76..432.76 rows=142 width=101) 
                  (actual time=2.103..2.103 rows=106 loops=1)
              Output: 
                t.entityid, t.inserted, t.type, t.status, t.amount, 
                t.abs_amount, t.is_credit, t.balance, t.description, 
                t.transactionid
              Buckets: 1024  Batches: 1  Memory Usage: 14kB
              ->  Index Scan using "transactions by entityid" 
                  on public.transactions t  
                     (cost=0.43..432.76 rows=142 width=101) 
                     (actual time=0.049..1.241 rows=106 loops=1)
                    Output: t.entityid, t.inserted, t.type, t.status, 
                            t.amount, t.abs_amount, t.is_credit, 
                            t.balance, t.description, t.transactionid
                    Index Cond: (t.entityid = 204425)
  ->  Index Scan using transactions_pkey on public.transactions i  
        (cost=0.43..8.20 rows=1 width=16) 
        (actual time=0.013..0.018 rows=1 loops=71)
        Output: i.entityid, i.transactionid, i.type, i.status, i.related, 
                i.amount, i.abs_amount, i.is_credit, i.inserted, i.description, 
                 i.balance
        Index Cond: (i.transactionid = g.initial)
Total runtime: 139575.140 ms

If I can bake that behaviour into a VIEW or an FDW then I can just replace the name of the VIEW in a very small number of queries to make it much more efficient. I don't care if it's super slow for some other use case (more complicated WHERE clause) and I'll name the VIEW to reflect its intended use.

The use_remote_estimate has its default value of FALSE but it makes no difference either way.

Question:

Is there some trickery I can use to make this admitted hack work?


Solution

  • If I've understood your question correctly, the answer is "no". There's no "trick" to get extra where clauses passed through the fdw wrapper.

    However, I think perhaps you're optimising the wrong thing.

    I'd replace the whole collapsed_transactions view. Unless I'm missing something it only depends on the transactions table. Create a table, keep it updated with triggers and only grant SELECT permissions to the normal user. Get yourself some testing tools from pgtap if you haven't already and you're good to go.


    EDIT: optimisation for the view.

    If all you want to do is optimise that one query against the view, and can tweak the definition of the view, try this:

    CREATE VIEW collapsed_transactions AS
    SELECT
        g.entityid,  -- THIS HERE
        g.initial,
        g.latest,
        i.inserted AS created,
        t.inserted AS updated,
        t.type,
        t.status,
        t.amount,
        t.abs_amount,
        t.is_credit,
        t.balance,
        t.description
    FROM (
        SELECT 
        entityid, -- THIS HERE
        COALESCE(x.related, x.transactionid) AS initial,
        max(x.transactionid) AS latest
        FROM transactions x
        GROUP BY entityid, COALESCE(x.related, x.transactionid)
    ) g
    INNER JOIN transactions t ON t.transactionid = g.latest
    INNER JOIN transactions i ON i.transactionid = g.initial;
    

    Note that the subquery exposes entityid and allows us to filter on it. I'm assuming that entityid is constant for the main and related items, otherwise I don't see how the query could work. That should give the planner enough grip on the problem to use an index on entityid first off and get the query down to millisecond timings.