(warning apologetics and hackery incomming...)
Background:
I have a legacy application for which I'd like to avoid rewriting a lot of its SQL code. I'm trying to speed up specific types of very costly queries that it does (i.e: the low hanging fruit).
It has a financial transaction ledger represented by a transactions
table. When a new row is inserted a trigger function (not shown here) carries forward the new balance for a given entity.
Some types of transaction model externalities (like in-flight payments) by tagging new transactions with a 'related' transaction so that the application can group related transactions together.
\d transactions
Table "public.transactions"
Column | Type | Modifiers
---------------------+-----------+-----------
entityid | bigint | not null
transactionid | bigint | not null default nextval('tid_seq')
type | smallint | not null
status | smallint | not null
related | bigint |
amount | bigint | not null
abs_amount | bigint | not null
is_credit | boolean | not null
inserted | timestamp | not null default now()
description | text | not null
balance | bigint | not null
Indexes:
"transactions_pkey" PRIMARY KEY, btree (transactionid)
"transactions by entityid" btree (entityid)
"transactions by initial trans" btree ((COALESCE(related, transactionid)))
Foreign-key constraints:
"invalid related transaction!" FOREIGN KEY (related)
REFERENCES transactions(transactionid)
In my test data set, I have:
So about 1/3 of all transaction rows are updates 'related' to some earlier transaction. The production data is about 25 times bigger transactionid
-wise and about 8 times bigger distinct entityid
-wise, and the 1/3 ratio carries through for transaction updates.
The code queries a particularly ineffecient VIEW which was defined as:
CREATE VIEW collapsed_transactions AS
SELECT t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM ( SELECT
COALESCE(x.related, x.transactionid) AS initial,
max(x.transactionid) AS latest
FROM transactions x
GROUP BY COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;
and typical queries take the form:
SELECT * FROM collapsed_transactions WHERE entityid = 204425;
As you can see, the where entityid = 204425
clause will not be used to constrain the GROUP BY
sub-query so all entitids' transactions will be grouped resulting in 55,000 bigger sub-query resultset and stupidly longer query time... all to arrive at an average of 40 rows (71 in this example) at the time of writing.
I can't normalise the transactions
table further (by say having initial_transactions
and updated_transactions
tables joined by related
) without rewriting hundreds of the codebase's SQL queries, many of which use the self-join semantics in different ways.
Insight:
I initially tried rewriting the queries using WINDOW functions but ran into all sorts of trouble with that (another SO question for another time), when I saw that the www_fdw passes its WHERE clause to HTTP as GET/POST parameters, I was very intriged by the possibility of very naiive queries being optimised without much restructuring.
The Postgresql 9.3 manual says:
F.31.4. Remote Query Optimization
postgres_fdw attempts to optimize remote queries to reduce the amount of data transferred from foreign servers. This is done by sending query WHERE clauses to the remote server for execution, and by not retrieving table columns that are not needed for the current query. To reduce the risk of misexecution of queries, WHERE clauses are not sent to the remote server unless they use only built-in data types, operators, and functions. Operators and functions in the clauses must be IMMUTABLE as well.
The query that is actually sent to the remote server for execution can be examined using EXPLAIN VERBOSE.
Attempt:
So I though that perhaps I could put the GROUP-BY into a view, treat that view as a foreign table and that the optimiser would pass through the WHERE clause to that foreign table resulting in a much more efficient query....
CREATE VIEW foreign_transactions_grouped_by_initial_transaction AS
SELECT
entityid,
COALESCE(t.related, t.transactionid) AS initial,
MAX(t.transactionid) AS latest
FROM transactions t
GROUP BY
t.entityid,
COALESCE(t.related, t.transactionid);
CREATE FOREIGN TABLE transactions_grouped_by_initial_transaction
(entityid bigint, initial bigint, latest bigint)
SERVER local_pg_server
OPTIONS (table_name 'foreign_transactions_grouped_by_initial_transaction');
EXPLAIN ANALYSE VERBOSE
SELECT
t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM transactions_grouped_by_initial_transaction g
INNER JOIN transactions t on t.transactionid = g.latest
INNER JOIN transactions i on i.transactionid = g.initial
WHERE g.entityid = 204425;
and that works very nicely!
Nested Loop (cost=100.87..305.05 rows=10 width=116)
(actual time=4.113..16.646 rows=71 loops=1)
Output: t.entityid, g.initial, g.latest, i.inserted,
t.inserted, t.type, t.status, t.amount, t.abs_amount,
t.balance, t.description
-> Nested Loop (cost=100.43..220.42 rows=10 width=108)
(actual time=4.017..10.725 rows=71 loops=1)
Output: g.initial, g.latest, t.entityid, t.inserted,
t.type, t.status, t.amount, t.abs_amount, t.is_credit,
t.balance, t.description
-> Foreign Scan on public.transactions_grouped_by_initial_transaction g
(cost=100.00..135.80 rows=10 width=16)
(actual time=3.914..4.694 rows=71 loops=1)
Output: g.entityid, g.initial, g.latest
Remote SQL:
SELECT initial, latest
FROM public.foreign_transactions_grouped_by_initial_transaction
WHERE ((entityid = 204425))
-> Index Scan using transactions_pkey on public.transactions t
(cost=0.43..8.45 rows=1 width=100)
(actual time=0.023..0.035 rows=1 loops=71)
Output: t.entityid, t.transactionid, t.type, t.status,
t.related, t.amount, t.abs_amount, t.is_credit,
t.inserted, t.description, t.balance
Index Cond: (t.transactionid = g.latest)
-> Index Scan using transactions_pkey on public.transactions i
(cost=0.43..8.45 rows=1 width=16)
(actual time=0.021..0.033 rows=1 loops=71)
Output: i.entityid, i.transactionid, i.type, i.status,
i.related, i.amount, i.abs_amount, i.is_credit,
i.inserted, i.description, i.balance
Index Cond: (i.transactionid = g.initial)
Total runtime: 20.363 ms
Problem:
However, when I try to bake that into a VIEW (with or without another layer of postgres_fdw
) the query optimiser doesn't seem to pass through the WHERE clause :-(
CREATE view collapsed_transactions_fast AS
SELECT
t.entityid,
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM transactions_grouped_by_initial_transaction g
INNER JOIN transactions t on t.transactionid = g.latest
INNER JOIN transactions i on i.transactionid = g.initial;
EXPLAIN ANALYSE VERBOSE
SELECT * FROM collapsed_transactions_fast WHERE entityid = 204425;
results in:
Nested Loop (cost=534.97..621.88 rows=1 width=117)
(actual time=104720.383..139307.940 rows=71 loops=1)
Output: t.entityid, g.initial, g.latest, i.inserted, t.inserted, t.type,
t.status, t.amount, t.abs_amount, t.is_credit, t.balance,
t.description
-> Hash Join (cost=534.53..613.66 rows=1 width=109)
(actual time=104720.308..139305.522 rows=71 loops=1)
Output: g.initial, g.latest, t.entityid, t.inserted, t.type,
t.status, t.amount, t.abs_amount, t.is_credit, t.balance,
t.description
Hash Cond: (g.latest = t.transactionid)
-> Foreign Scan on public.transactions_grouped_by_initial_transaction g
(cost=100.00..171.44 rows=2048 width=16)
(actual time=23288.569..108916.051 rows=3705600 loops=1)
Output: g.entityid, g.initial, g.latest
Remote SQL:
SELECT initial, latest
FROM public.foreign_transactions_grouped_by_initial_transaction
-> Hash (cost=432.76..432.76 rows=142 width=101)
(actual time=2.103..2.103 rows=106 loops=1)
Output:
t.entityid, t.inserted, t.type, t.status, t.amount,
t.abs_amount, t.is_credit, t.balance, t.description,
t.transactionid
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Index Scan using "transactions by entityid"
on public.transactions t
(cost=0.43..432.76 rows=142 width=101)
(actual time=0.049..1.241 rows=106 loops=1)
Output: t.entityid, t.inserted, t.type, t.status,
t.amount, t.abs_amount, t.is_credit,
t.balance, t.description, t.transactionid
Index Cond: (t.entityid = 204425)
-> Index Scan using transactions_pkey on public.transactions i
(cost=0.43..8.20 rows=1 width=16)
(actual time=0.013..0.018 rows=1 loops=71)
Output: i.entityid, i.transactionid, i.type, i.status, i.related,
i.amount, i.abs_amount, i.is_credit, i.inserted, i.description,
i.balance
Index Cond: (i.transactionid = g.initial)
Total runtime: 139575.140 ms
If I can bake that behaviour into a VIEW or an FDW then I can just replace the name of the VIEW in a very small number of queries to make it much more efficient. I don't care if it's super slow for some other use case (more complicated WHERE clause) and I'll name the VIEW to reflect its intended use.
The use_remote_estimate
has its default value of FALSE
but it makes no difference either way.
Question:
Is there some trickery I can use to make this admitted hack work?
If I've understood your question correctly, the answer is "no". There's no "trick" to get extra where clauses passed through the fdw wrapper.
However, I think perhaps you're optimising the wrong thing.
I'd replace the whole collapsed_transactions
view. Unless I'm missing something it only depends on the transactions table. Create a table, keep it updated with triggers and only grant SELECT permissions to the normal user. Get yourself some testing tools from pgtap if you haven't already and you're good to go.
EDIT: optimisation for the view.
If all you want to do is optimise that one query against the view, and can tweak the definition of the view, try this:
CREATE VIEW collapsed_transactions AS
SELECT
g.entityid, -- THIS HERE
g.initial,
g.latest,
i.inserted AS created,
t.inserted AS updated,
t.type,
t.status,
t.amount,
t.abs_amount,
t.is_credit,
t.balance,
t.description
FROM (
SELECT
entityid, -- THIS HERE
COALESCE(x.related, x.transactionid) AS initial,
max(x.transactionid) AS latest
FROM transactions x
GROUP BY entityid, COALESCE(x.related, x.transactionid)
) g
INNER JOIN transactions t ON t.transactionid = g.latest
INNER JOIN transactions i ON i.transactionid = g.initial;
Note that the subquery exposes entityid and allows us to filter on it. I'm assuming that entityid is constant for the main and related items, otherwise I don't see how the query could work. That should give the planner enough grip on the problem to use an index on entityid first off and get the query down to millisecond timings.