Search code examples
postgresqlindexingquery-optimizationnearest-neighborpostgresql-9.4

LATERAL JOIN not using trigram index


I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

I also have a table of location data:

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

That gave the expected result, but it was slow. Here's the query plan:

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

I tried adding a gist trigram index to the address column, like mentioned at https://stackoverflow.com/a/13452528/36191, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER or LIMIT:

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?


Solution

  • Why?

    The query cannot use the index on principal. You would need an index on the table locations, but the one you have is on the table addresses.

    You can verify my claim by setting:

    SET enable_seqscan = off;
    

    (In your session only, and for debugging only. Never use it in production.) It's not like the index would be more expensive than a sequential scan, there is just no way for Postgres to use it for your query at all.

    Aside: [INNER] JOIN ... ON true is just an awkward way of saying CROSS JOIN ...

    Why is the index used after removing ORDER and LIMIT?

    Because Postgres can rewrite this simple form to:

    SELECT *
    FROM   addresses a
    JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';
    

    You'll see the exact same query plan. (At least I do in my tests on Postgres 9.5.)

    Solution

    You need an index on locations.postalcode. And while using LIKE or ILIKE you would also need to bring the indexed expression (postalcode) to the left side of the operator. ILIKE is implemented with the operator ~~* and this operator has no COMMUTATOR (a logical necessity), so it's not possible to flip operands around. Detailed explanation in these related answers:

    A solution is to use the trigram similarity operator % or its inverse, the distance operator <-> in a nearest neighbour query instead (each is commutator for itself, so operands can switch places freely):

    SELECT *
    FROM   addresses a
    JOIN   LATERAL (
       SELECT *
       FROM   locations
       ORDER  BY postalcode <-> a.address
       LIMIT  1
       ) l ON address ILIKE '%' || postalcode || '%';

    Find the most similar postalcode for each address, and then check if that postalcode actually matches fully.

    This way, a longer postalcode will be preferred automatically since it's more similar (smaller distance) than a shorter postalcode that also matches.

    A bit of uncertainty remains. Depending on possible postal codes, there could be false positives due to matching trigrams in other parts of the string. There is not enough information in the question to say more.

    Here, [INNER] JOIN instead of CROSS JOIN makes sense, since we add an actual join condition.

    The manual:

    This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

    So:

    CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
    USING gist (postalcode gist_trgm_ops);