I'm trying to find all posts that were within a certain distance, but unfortunately for some inputs the query is extremely slow. Here's some examples:
-- fast (1 millisecond)
SELECT 1
FROM post po
WHERE ST_DWithin(po.geog, ST_SetSRID(ST_MakePoint(-47, -70), 4326)::geography, 4500 * 1609.34)
LIMIT 10;
-- slow (2 seconds)
SELECT 1
FROM post po
WHERE (po.geog <-> ST_SetSRID(ST_MakePoint(-47, -70), 4326)::geography) < 4500 * 1609.34
LIMIT 10;
-- slow (9 seconds)
SELECT 1
FROM post po
WHERE ST_DWithin(po.geog, ST_SetSRID(ST_MakePoint(-70, 40), 4326)::geography, 4500 * 1609.34)
ORDER BY po.reply_count DESC, convo_id DESC
LIMIT 10;
-- fast (1 millisecond)
SELECT 1
FROM post po
WHERE (po.geog <-> ST_SetSRID(ST_MakePoint(-70, 40), 4326)::geography) < 4500 * 1609.34
ORDER BY po.reply_count DESC, convo_id DESC
LIMIT 10;
Here is the visualization of the EXPLAIN ANALYZE
for the third query that is taking 9 seconds: https://explain.depesz.com/s/Xd6d
Here is the EXPLAIN ANALYZE
for the fourth query: https://explain.depesz.com/s/zcKa
Basically, depending on the inputs, it seems like the non-indexed distance using the <->
is sometimes faster, and then for other inputs, the indexed distance operator (ST_DWithin
) is faster.
I think that ST_DWithin
should basically always be faster (or at least complete in a reasonable amount of time), but for some reason in this case it is having an incredible runtime. Does anyone know why the query planner is so off? Based on the explain output, it looks like Postgres thinks there's going to be 100 rows, but there are actually 1,000,000 rows.
Here are the relevant indexes that I have:
CREATE UNIQUE INDEX post_pk ON public.post USING btree (convo_id)
CREATE INDEX post_geog_spidx ON public.post USING spgist (geog)
CREATE INDEX post_reply_count_convo_id_idx ON public.post USING btree (reply_count, convo_id)
CREATE INDEX post_reply_count_idx ON public.post USING btree (reply_count)
Using a gist
instead of a spgist
for geog
did not affect the runtime.
All my geographies are points and I have run VACUUM (ANALYSE, VERBOSE);
already
My version numbers are:
PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
and
POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 " PROJ="Rel. 5.2.0, September 15th, 2018" LIBXML="2.9.9" LIBJSON="0.12" LIBPROTOBUF="1.2.1" WAGYU="0.4.3 (Internal)" TOPOLOGY
Estimating the number of rows which will be returned by these types of multidimensional operations is very hard. It would require types of statistics which PostgreSQL doesn't gather.
So it is mostly upon you to figure out how many rows will qualify (before the LIMIT), and craft your query appropriately. Is your ST_DWithin always so unrestrictive? Do you know ahead of time when it is going to be so unrestrictive?
I'm not a GIS expert, but I think the units for SRID 4326 is degrees, so pretty much everything is going to be within (4500 * 1609.34) degrees of everything else, isn't it?