Search code examples
postgresqlperformanceindexingquery-optimizationpg-trgm

Postgres: How to optimize similarity queries searching columns on multiple tables? (pg_trgm)


I am trying to optimize a similarity (word_similarity) query, which searches multiple columns on 2 tables.

The tables and their indices look like this:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Table definitions
CREATE TABLE usr (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    loc_id bigint,
    first_name text,
    last_name text
);
CREATE TABLE loc (
    id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
    country text,
    city text
);

-- Primary keys
ALTER TABLE ONLY usr
    ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
    ADD CONSTRAINT loc_pkey PRIMARY KEY (id);

-- Foreign key
ALTER TABLE ONLY usr.loc_id
    ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);

-- Indices
CREATE INDEX usr_loc_id_idx ON usr
    USING btree (loc_id);
    
CREATE INDEX usr_first_name_last_name_idx ON usr
    USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
    USING gist ((country || ' ' || city) gist_trgm_ops);

-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;

INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;

When searching the columns of just one table, everything seems to be optimized fine:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER  BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;

This produces the below query plan:

Limit  (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
  ->  Nested Loop Left Join  (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
        ->  Index Scan using usr_first_name_last_name_idx on usr  (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
              Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
              Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
        ->  Index Scan using loc_pkey on loc  (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
              Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms

My attempt to write a query that searches columns on both the usr and loc tables is:

SELECT
    usr.id AS usr_id, usr.first_name, usr.last_name,
    loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
    'baker' <% (usr.first_name || ' ' || usr.last_name)
    OR
    'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
    ('baker' <<-> (usr.first_name || ' ' || usr.last_name))
    +
    ('baker' <<-> (loc.country || ' ' || loc.city));

And the query plan for this one is:

Gather Merge  (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Sort  (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
        Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
        Sort Method: quicksort  Memory: 25kB
        Worker 0:  Sort Method: quicksort  Memory: 25kB
        Worker 1:  Sort Method: quicksort  Memory: 25kB
        ->  Parallel Hash Left Join  (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
              Hash Cond: (usr.loc_id = loc.id)
              Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
              Rows Removed by Filter: 333335
              ->  Parallel Seq Scan on usr  (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
              ->  Parallel Hash  (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
                    Buckets: 131072  Batches: 1  Memory Usage: 10432kB
                    ->  Parallel Seq Scan on loc  (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms

Is there a way I can improve this query? Or is such a "multi-table similarity" query not possible in Postgres?

Thank you!


Solution

  • It looks like OR prevented your index from kicking in. You could instead run your select with each of the conditions separately, UNION that, then apply your desired ORDER after: demo

    EXPLAIN ANALYSE VERBOSE
    SELECT * FROM
    (   SELECT
            usr.id AS usr_id, usr.first_name, usr.last_name,
            loc.id AS loc_id, loc.country, loc.city
        FROM usr
        LEFT JOIN loc ON usr.loc_id = loc.id
        WHERE 'baker' <% (loc.country || ' ' || loc.city)
        UNION
        SELECT
            usr.id AS usr_id, usr.first_name, usr.last_name,
            loc.id AS loc_id, loc.country, loc.city
        FROM usr
        LEFT JOIN loc ON usr.loc_id = loc.id
        WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
    ) ORDER BY 
        ('baker' <<-> (first_name || ' ' || last_name))
        +
        ('baker' <<-> (country || ' ' || city));
    
    | QUERY PLAN |
    |:-----------|
    | Sort  (cost=5112.22..5115.00 rows=1111 width=148) (actual time=34.165..34.171 rows=0 loops=1) |
    |   Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
    |   Sort Key: ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
    |   Sort Method: quicksort  Memory: 25kB |
    |   ->  Subquery Scan on \_  (cost=5025.47..5056.02 rows=1111 width=148) (actual time=34.159..34.164 rows=0 loops=1) |
    |         Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, (('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city))) |
    |         ->  HashAggregate  (cost=5025.47..5036.58 rows=1111 width=144) (actual time=34.158..34.163 rows=0 loops=1) |
    |               Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
    |               Group Key: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
    |               Batches: 1  Memory Usage: 73kB |
    |               ->  Append  (cost=798.13..5008.80 rows=1111 width=144) (actual time=34.144..34.148 rows=0 loops=1) |
    |                     ->  Hash Join  (cost=798.13..2240.77 rows=555 width=144) (actual time=16.292..16.294 rows=0 loops=1) |
    |                           Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
    |                           Inner Unique: true |
    |                           Hash Cond: (usr.loc\_id = loc.id) |
    |                           ->  Seq Scan on usr.usr  (cost=0.00..1296.75 rows=55575 width=80) (actual time=0.007..0.007 rows=1 loops=1) |
    |                                 Output: usr.id, usr.loc\_id, usr.first\_name, usr.last\_name |
    |                           ->  Hash  (cost=791.23..791.23 rows=552 width=72) (actual time=16.280..16.281 rows=0 loops=1) |
    |                                 Output: loc.id, loc.country, loc.city |
    |                                 Buckets: 1024  Batches: 1  Memory Usage: 8kB |
    |                                 ->  Bitmap Heap Scan on usr.loc  (cost=104.56..791.23 rows=552 width=72) (actual time=16.279..16.280 rows=0 loops=1) |
    |                                       Output: loc.id, loc.country, loc.city |
    |                                       Filter: ('baker'::text \<% ((loc.country \|\| ' '::text) \|\| loc.city)) |
    |                                       ->  Bitmap Index Scan on loc\_country\_city\_idx  (cost=0.00..104.42 rows=552 width=0) (actual time=16.275..16.275 rows=0 loops=1) |
    |                                             Index Cond: (((loc.country \|\| ' '::text) \|\| loc.city) %> 'baker'::text) |
    |                     ->  Hash Left Join  (cost=2029.53..2762.48 rows=556 width=144) (actual time=17.844..17.845 rows=0 loops=1) |
    |                           Output: usr\_1.id, usr\_1.first\_name, usr\_1.last\_name, loc\_1.id, loc\_1.country, loc\_1.city |
    |                           Inner Unique: true |
    |                           Hash Cond: (usr\_1.loc\_id = loc\_1.id) |
    |                           ->  Bitmap Heap Scan on usr.usr usr\_1  (cost=104.59..836.07 rows=556 width=80) (actual time=17.843..17.843 rows=0 loops=1) |
    |                                 Output: usr\_1.id, usr\_1.loc\_id, usr\_1.first\_name, usr\_1.last\_name |
    |                                 Filter: ('baker'::text \<% ((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name)) |
    |                                 ->  Bitmap Index Scan on usr\_first\_name\_last\_name\_idx  (cost=0.00..104.45 rows=556 width=0) (actual time=17.838..17.839 rows=0 loops=1) |
    |                                       Index Cond: (((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name) %> 'baker'::text) |
    |                           ->  Hash  (cost=1234.42..1234.42 rows=55242 width=72) (never executed) |
    |                                 Output: loc\_1.id, loc\_1.country, loc\_1.city |
    |                                 ->  Seq Scan on usr.loc loc\_1  (cost=0.00..1234.42 rows=55242 width=72) (never executed) |
    |                                       Output: loc\_1.id, loc\_1.country, loc\_1.city |
    | Planning Time: 0.314 ms |
    | Execution Time: 34.307 ms |