I am trying to optimize a similarity (word_similarity
) query, which searches multiple columns on 2 tables.
The tables and their indices look like this:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Table definitions
CREATE TABLE usr (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
loc_id bigint,
first_name text,
last_name text
);
CREATE TABLE loc (
id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
country text,
city text
);
-- Primary keys
ALTER TABLE ONLY usr
ADD CONSTRAINT usr_pkey PRIMARY KEY (id);
ALTER TABLE ONLY loc
ADD CONSTRAINT loc_pkey PRIMARY KEY (id);
-- Foreign key
ALTER TABLE ONLY usr.loc_id
ADD CONSTRAINT usr_loc_id_fkey FOREIGN KEY (loc_id) REFERENCES loc(id);
-- Indices
CREATE INDEX usr_loc_id_idx ON usr
USING btree (loc_id);
CREATE INDEX usr_first_name_last_name_idx ON usr
USING gist ((first_name || ' ' || last_name) gist_trgm_ops);
CREATE INDEX loc_country_city_idx ON loc
USING gist ((country || ' ' || city) gist_trgm_ops);
-- Insert dummy data
INSERT INTO loc (country, city)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25)
FROM generate_series(1, 100000) AS t;
INSERT INTO usr (first_name, last_name, loc_id)
SELECT substr(md5(random()::text), 0, 25), substr(md5(random()::text), 0, 25), trunc(random() * 99999 + 1)
FROM generate_series(1, 1000000) AS t;
When searching the columns of just one table, everything seems to be optimized fine:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
ORDER BY 'baker' <<-> (usr.first_name || ' ' || usr.last_name)
LIMIT 20;
This produces the below query plan:
Limit (cost=0.70..236.19 rows=20 width=118) (actual time=397.086..596.842 rows=1 loops=1)
-> Nested Loop Left Join (cost=0.70..1178.16 rows=100 width=118) (actual time=397.084..596.839 rows=1 loops=1)
-> Index Scan using usr_first_name_last_name_idx on usr (cost=0.41..422.41 rows=100 width=66) (actual time=397.038..596.792 rows=1 loops=1)
Index Cond: (((first_name || ' '::text) || last_name) %> 'baker'::text)
Order By: (((first_name || ' '::text) || last_name) <->> 'baker'::text)
-> Index Scan using loc_pkey on loc (cost=0.29..7.55 rows=1 width=56) (actual time=0.037..0.037 rows=1 loops=1)
Index Cond: (id = usr.loc_id)
Planning Time: 1.252 ms
Execution Time: 596.931 ms
My attempt to write a query that searches columns on both the usr
and loc
tables is:
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE
'baker' <% (usr.first_name || ' ' || usr.last_name)
OR
'baker' <% (loc.country || ' ' || loc.city)
ORDER BY
('baker' <<-> (usr.first_name || ' ' || usr.last_name))
+
('baker' <<-> (loc.country || ' ' || loc.city));
And the query plan for this one is:
Gather Merge (cost=21071.24..21090.61 rows=166 width=118) (actual time=6297.322..6301.774 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=20071.22..20071.42 rows=83 width=118) (actual time=6286.205..6286.208 rows=0 loops=3)
Sort Key: ((('baker'::text <<-> ((usr.first_name || ' '::text) || usr.last_name)) + ('baker'::text <<-> ((loc.country || ' '::text) || loc.city))))
Sort Method: quicksort Memory: 25kB
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Parallel Hash Left Join (cost=2460.57..20068.57 rows=83 width=118) (actual time=4197.337..6284.058 rows=0 loops=3)
Hash Cond: (usr.loc_id = loc.id)
Filter: (('baker'::text <% ((usr.first_name || ' '::text) || usr.last_name)) OR ('baker'::text <% ((loc.country || ' '::text) || loc.city)))
Rows Removed by Filter: 333335
-> Parallel Seq Scan on usr (cost=0.00..16512.69 rows=416669 width=66) (actual time=0.078..214.231 rows=333335 loops=3)
-> Parallel Hash (cost=1725.25..1725.25 rows=58825 width=56) (actual time=23.072..23.073 rows=33334 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 10432kB
-> Parallel Seq Scan on loc (cost=0.00..1725.25 rows=58825 width=56) (actual time=0.007..5.594 rows=33334 loops=3)
Planning Time: 3.470 ms
Execution Time: 6301.859 ms
Is there a way I can improve this query? Or is such a "multi-table similarity" query not possible in Postgres?
Thank you!
It looks like OR
prevented your index from kicking in. You could instead run your select with each of the conditions separately, UNION
that, then apply your desired ORDER
after: demo
EXPLAIN ANALYSE VERBOSE
SELECT * FROM
( SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (loc.country || ' ' || loc.city)
UNION
SELECT
usr.id AS usr_id, usr.first_name, usr.last_name,
loc.id AS loc_id, loc.country, loc.city
FROM usr
LEFT JOIN loc ON usr.loc_id = loc.id
WHERE 'baker' <% (usr.first_name || ' ' || usr.last_name)
) ORDER BY
('baker' <<-> (first_name || ' ' || last_name))
+
('baker' <<-> (country || ' ' || city));
| QUERY PLAN |
|:-----------|
| Sort (cost=5112.22..5115.00 rows=1111 width=148) (actual time=34.165..34.171 rows=0 loops=1) |
| Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
| Sort Key: ((('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city)))) |
| Sort Method: quicksort Memory: 25kB |
| -> Subquery Scan on \_ (cost=5025.47..5056.02 rows=1111 width=148) (actual time=34.159..34.164 rows=0 loops=1) |
| Output: \_.usr\_id, \_.first\_name, \_.last\_name, \_.loc\_id, \_.country, \_.city, (('baker'::text \<\<-> ((\_.first\_name \|\| ' '::text) \|\| \_.last\_name)) + ('baker'::text \<\<-> ((\_.country \|\| ' '::text) \|\| \_.city))) |
| -> HashAggregate (cost=5025.47..5036.58 rows=1111 width=144) (actual time=34.158..34.163 rows=0 loops=1) |
| Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Group Key: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Batches: 1 Memory Usage: 73kB |
| -> Append (cost=798.13..5008.80 rows=1111 width=144) (actual time=34.144..34.148 rows=0 loops=1) |
| -> Hash Join (cost=798.13..2240.77 rows=555 width=144) (actual time=16.292..16.294 rows=0 loops=1) |
| Output: usr.id, usr.first\_name, usr.last\_name, loc.id, loc.country, loc.city |
| Inner Unique: true |
| Hash Cond: (usr.loc\_id = loc.id) |
| -> Seq Scan on usr.usr (cost=0.00..1296.75 rows=55575 width=80) (actual time=0.007..0.007 rows=1 loops=1) |
| Output: usr.id, usr.loc\_id, usr.first\_name, usr.last\_name |
| -> Hash (cost=791.23..791.23 rows=552 width=72) (actual time=16.280..16.281 rows=0 loops=1) |
| Output: loc.id, loc.country, loc.city |
| Buckets: 1024 Batches: 1 Memory Usage: 8kB |
| -> Bitmap Heap Scan on usr.loc (cost=104.56..791.23 rows=552 width=72) (actual time=16.279..16.280 rows=0 loops=1) |
| Output: loc.id, loc.country, loc.city |
| Filter: ('baker'::text \<% ((loc.country \|\| ' '::text) \|\| loc.city)) |
| -> Bitmap Index Scan on loc\_country\_city\_idx (cost=0.00..104.42 rows=552 width=0) (actual time=16.275..16.275 rows=0 loops=1) |
| Index Cond: (((loc.country \|\| ' '::text) \|\| loc.city) %> 'baker'::text) |
| -> Hash Left Join (cost=2029.53..2762.48 rows=556 width=144) (actual time=17.844..17.845 rows=0 loops=1) |
| Output: usr\_1.id, usr\_1.first\_name, usr\_1.last\_name, loc\_1.id, loc\_1.country, loc\_1.city |
| Inner Unique: true |
| Hash Cond: (usr\_1.loc\_id = loc\_1.id) |
| -> Bitmap Heap Scan on usr.usr usr\_1 (cost=104.59..836.07 rows=556 width=80) (actual time=17.843..17.843 rows=0 loops=1) |
| Output: usr\_1.id, usr\_1.loc\_id, usr\_1.first\_name, usr\_1.last\_name |
| Filter: ('baker'::text \<% ((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name)) |
| -> Bitmap Index Scan on usr\_first\_name\_last\_name\_idx (cost=0.00..104.45 rows=556 width=0) (actual time=17.838..17.839 rows=0 loops=1) |
| Index Cond: (((usr\_1.first\_name \|\| ' '::text) \|\| usr\_1.last\_name) %> 'baker'::text) |
| -> Hash (cost=1234.42..1234.42 rows=55242 width=72) (never executed) |
| Output: loc\_1.id, loc\_1.country, loc\_1.city |
| -> Seq Scan on usr.loc loc\_1 (cost=0.00..1234.42 rows=55242 width=72) (never executed) |
| Output: loc\_1.id, loc\_1.country, loc\_1.city |
| Planning Time: 0.314 ms |
| Execution Time: 34.307 ms |