I'm developing a collection of foreign data wrappers using multicorn and I've run into an issue with batching data.
So, I have two foreign tables, search
and data
, that are each backed by a foreign data wrapper that I'm writing.
I need to do a basic join on these tables:
SELECT data.*
FROM search, data
WHERE search.data_id = data.id
AND search.term = 'search for this pls'
This works, but there's a hitch in the data
fdw being able to batch queries to the server. If the search
table returns 5 ids for a given search, then the data
fdw is executed once for each of those ids. The API backing the data
fdw is capable of processing many ids in one request.
The following works:
SELECT data.*
FROM data
WHERE id in ('2244', '31895')
In this case the data
fdw receives an array of both ids and is able to perform one request.
Is there any way to make the join work where the data
fdw has the opportunity to batch ids for a request?
Thanks!
You should look at the EXPLAIN
output for your query, and then you'll probably see that PostgreSQL is performing a nested loop join, i.e. it scans search
for the matching rows, and for each result row scans data
for matching rows.
PostgreSQL has other join strategies like hash joins, but for that it would have to read the whole data
table, which is probably not a win.
You might want to try it by setting enable_nestloop
to off
and testing query performance. If that is an improvement, you might want to adjust the cost values for the foreign table scan on data
to reflect the high “startup costs” so that the planner becomes more reluctant to choose a nested loop join.
There is no such join strategy as you propose – while it may well be a win for FDW joins, it does not offer advantages in regular joins. So if the join strategy you envision is really the optimal one, you'd have to first fetch the data_id
s from search
, construct a query for data
and implement the join in the application.