Search code examples
postgresqlquery-optimization

How to Optimize query of substring searching in Postgres?


I have 2 tables both as the same form

table A : content | text

table B : SubContent | text

Table B containing substring Of Texts in Table A

for exampale:

Table a :

  • zupta . . for making south africa worse than ever .
  • zur eda terkenang arwah ibu . . eda terkenang arwah ibu . .

Table B :

  • making south africa
  • terkenang arwah ibu

what i want to achive is :

making south africa | zupta . . for **making south africa** worse than ever.
terkenang arwah ibu | zur eda **terkenang arwah ibu** . . eda terkenang arwah ibu . .

the query I used is very simple

select subcontent, content from A join B on  content SIMILAR TO '%'||subcontent||'%'

the query explain :

Gather  (cost=1000.00..6611032.19 rows=1678309 width=180)
  Workers Planned: 2
  ->  Nested Loop  (cost=0.00..6442201.29 rows=699295 width=180)
"        Join Filter: (A.content ~ similar_escape((('%'::text || B.subcontent) || '%'::text), NULL::text))"
        ->  Parallel Seq Scan on A  (cost=0.00..8675.79 rows=186479 width=91)
        ->  Seq Scan on B  (cost=0.00..19.50 rows=750 width=89)

i try use Large Text Indexes follow by this blog (Andrew Kane done a great job! )

https://medium.com/@ankane/large-text-indexes-in-postgres-5d7f1677f89f

i try to add an index on both of the columns that didn't help the execution takes 40 minutes, any ideas?

explain after using Laurenz Albe answer

Nested Loop  (cost=27.79..100352.93 rows=1678309 width=180)
  ->  Seq Scan on q2_distinct  (cost=0.00..19.50 rows=750 width=89)
  ->  Bitmap Heap Scan on clean_distinct  (cost=27.79..111.40 rows=2238 width=91)
"        Recheck Cond: (A.content ~ similar_escape((('%'::text || q2_distinct.part) || '%'::text), NULL::text))"
        ->  Bitmap Index Scan on "clean_distinct_Post content_idx"  (cost=0.00..27.23 rows=2238 width=0)
"              Index Cond: (A.content ~ similar_escape((('%'::text || B.content) || '%'::text), NULL::text))"

Solution

  • Any join that does not have a join condition with an equality operator (=) can only use a nested loop join.

    The only index that can potentially help here is a trigram index on A:

    CREATE EXTENSION pg_trgm;
    CREATE INDEX ON "A" USING gin (content gin_trgm_ops);
    

    But you shouldn't expect to get good execution times with a query like that.