Search code examples
sqlquery-optimizationpostgispostgresql-9.3

optimizing postgis indexed query. NOTICE: gserialized_gist_joinsel: jointype 1 not supported


I'm running a query to determine the number of points in a polygon, for each polygon in a table. the point layer (twitter) has 500m rows and the polygon layer (us_urbanareas) has ~3000 rows. The point layer's geometry is indexed.

the query looks like:

EXPLAIN SELECT us_urbanareas.name, us_urbanareas.gid, count(twitter.coords) AS total
FROM us_urbanareas LEFT JOIN twitter
ON st_contains(ST_Transform(us_urbanareas.geom,4326),twitter.coords)
GROUP BY us_urbanareas.gid LIMIT 500

and the explain looks like:

"Limit  (cost=1265.59..47875481.71 rows=500 width=45)"
"  ->  GroupAggregate  (cost=1265.59..342780653.01 rows=3580 width=45)"
"        ->  Nested Loop Left Join  (cost=1265.59..340247956.29 rows=506532183 width=45)"
"              ->  Index Scan using us_urbanareas_pkey on us_urbanareas  (cost=0.28..1000.18 rows=3580 width=4028)"
"              ->  Bitmap Heap Scan on twitter  (cost=1265.31..94899.56 rows=14149 width=32)"
"                    Recheck Cond: (st_transform(us_urbanareas.geom, 4326) && coords)"
"                    Filter: _st_contains(st_transform(us_urbanareas.geom, 4326), coords)"
"                    ->  Bitmap Index Scan on coord_gist  (cost=0.00..1261.77 rows=42447 width=0)"
"                          Index Cond: (st_transform(us_urbanareas.geom, 4326) && coords)"

The query works, but is fairly slow. the error message i receive is:

NOTICE:  gserialized_gist_joinsel: jointype 1 not supported

The db server that I'm working on has loads of memory (96gb), and we've just been through a process of optimizing its configuration to make the most of it. I'm running PostGres 9.3.2

Any thoughts on how to optimize the query and get around the jointype not supported?

Update:

updated the SRID's of the us_urbanareas table to remove the transform 2. removed the left join and re-wrote query as per suggestion below 3. tried turning bitmapscan = off

turning off the bitmapscan drastically reduces the cost in the explain, but increases the query time (tested using a few random polys) twofold.

"Limit  (cost=0.83..50601686.53 rows=500 width=45)"
"  ->  GroupAggregate  (cost=0.83..362308070.41 rows=3580 width=45)"
"        ->  Nested Loop  (cost=0.83..359737718.66 rows=514063189 width=45)"
"              ->  Index Scan using us_urbanareas_pkey on us_urbanareas us  (cost=0.28..1002.76 rows=3580 width=45)"
"              ->  Index Scan using coord_gist on twitter tw  (cost=0.55..100341.53 rows=14359 width=32)"
"                    Index Cond: (us.geom && coords)"
"                    Filter: _st_contains(us.geom, coords)"

update 4. executing

select * from pg_stat_activity;

results with,

"INSERT INTO twitter(tweet,name,handle,location,coords,time) VALUES ($$T :($$,$$anissa aguilar$$,$$babyniss_$$,$$Abq/Rio, NM $$,ST_GeomFromText('POINT(-106.659914 35.23192)', (...)"
"SELECT version();"
"SELECT t.oid, t.xmin, t.*, relname, CASE WHEN relkind = 'r' THEN TRUE ELSE FALSE END AS parentistable,   nspname, des.description, l.lanname, p.prosrc, 
  substring(pg_get_triggerdef(t.oid), 'WHEN (.*) EXECUTE PROCEDURE') AS whenclause
  FROM pg_trigger t
 (...)"
"select * from pg_stat_activity;"
" SELECT us.name, us.gid, count(tw.coords) AS total
   FROM us_urbanareas us,  twitter tw
WHERE st_contains(us.geom, tw.coords)
   GROUP BY us.gid;"

Solution

  • Try using the Cartesian product syntax, ie, full join, which in Postgres can be simplified to just a comma. I know this sounds counterintuitive, but the spatial index on the larger table will take care of which polygons are actually included via ST_Contains. I personally find the notion of a LEFT JOIN when you are dealing with spatial intersections to be unnatural, and this becomes more obvious when you have multiple spatial joins, as there is no natural correlate with left or right joins, when you are dealing with intersections/containment in a two-dimensional space.

    EXPLAIN SELECT us.name, us.gid, count(tw.coords) AS total
       FROM us_urbanareas us,  twitter tw
    WHERE st_contains(ST_Transform(us.geom,4326),tw.coords)
       GROUP BY us.gid LIMIT 500;
    

    Another thing you can do that might speed things up a little is to convert the us_urbanareas to 4326 before doing the query. It might not make a difference on such a small table, but in general, if you can avoid transforming fields that are part of the join it will help the optimizer and be one less operation to perform -- it can make a very big difference on joins involving tables with many rows on both sides. You can do this by using the UpdateGeometrySrid function, which will also update the metadata view, geometry_columns.

    For example, in your case:

    Select UpdateGeometrySrid('us_urbanareas', 'geom', 4326);
    

    After which, your query will simplify to:

    SELECT us.name, us.gid, count(tw.coords) AS total
       FROM us_urbanareas us,  twitter tw
    WHERE st_contains(us.geom, tw.coords)
       GROUP BY us.gid LIMIT 500;
    

    I assume you have seen this, but there is an exchange between some of the Postgis devs about this exact message and how the left join syntax can "confuse" the Postgres query optimizer, so that even though the query runs, it might have a suboptimal plan, exactly as you have stated in your question. See, http://trac.osgeo.org/postgis/ticket/2342

    EDIT: While I suspect that removing the ST_Transform and rewriting the join condition will both get rid of the error and speed up the query, there are other things you can do to force the optimizer to choose certain plans. It is a bit of a black art, and generally not recommended, but spatial queries are a bit different in that the complexity of the geometrical operations often greatly outweighs the cost of fetching from disk -- which is, in essence, what explain metrics units are measured in. You can try turning on or off various join types, such as seq_scan or hash_join, see query config variables, eg,

    SET enable_seqscan=off;
    

    But, I repeat, this is something of a black art and generally the optimizer "knows" best, but I do have experience of cutting query times by orders of magnitude by turning off various join types.

    EDIT 2: having looked again at your query, I am puzzled by the need for the bitmap heapscan in there. I just ran a similar query, which finds the area of intersection of a grid of tiles, containing density counts, and some polygons, and groups by the polygon ID, so very analogous. My query looks like:

    SELECT sum(ceil(st_area(st_intersection(ww.geom, dens.geom))/1000000 * osmm_add)), ww.gid 
    FROM density_2010 dens, Wales_west_al2 ww 
    WHERE st_intersects(dens.geom, ww.geom) GROUP BY ww.gid;
    

    and my explain,

    GroupAggregate  (cost=0.00..122642.68 rows=52429 width=552)
      ->  Nested Loop  (cost=0.00..108093.63 rows=52429 width=552)
           Join Filter: _st_intersects(dens.geom, ww.geom)
             ->  Index Scan using wales_west_al2_pkey on wales_west_al2 ww  (cost=0.00..6541.84 rows=124173 width=36)
             ->  Index Scan using ix_spatial_os_density on density_2010 dens  (cost=0.00..0.56 rows=1 width=516)
                   Index Cond: (geom && ww.geom)
    

    This is similar to yours, except for the absence of bitmap heap scans. Could you try setting:

    set enable_bitmapscan=off;
    

    and see if that makes any difference.