Search code examples
sqlpostgresqlsimilaritypg-trgm

How do I fetch similar posts from database where text length can be more than 2000 characters


As far as I'm aware, there is no simple, quick solution. I am trying to do a full-text keyword or semantic search, which is a very advanced topic. There are dedicated search servers created specifically for that reason, but still is there a way that I can implement for a query execution time for less than a second?

Here's what I have tried so far:

begin;

SET pg_trgm.similarity_threshold = 0.3;

select
    id, <col_name>
    similarity(<column with gin index>,
    '<text to be searched>') as sml
from
    <table> p
where
    <clauses> and
 <indexed_col> % '<text to be searched>'    
 and indexed_col <-> '<text to be searched>' < 0.5
order by
indexed_col <-> '<text to be searched>'
limit 10;
 end;

Index created is as follows: CREATE INDEX trgm_idx ON posts USING gin (post_title_combined gin_trgm_ops);

The above query takes around 6-7 secs to execute and sometimes even 200 ms which is weird to me because it changes the query plan according to the input I pass in for similarity.

I tried ts_vector @@ ts_query, but they turn out to be too strict due to & operator.

EDIT: Here's the EXPLAIN ANALYZE of the above query

  ->  Sort  (cost=463.82..463.84 rows=5 width=321) (actual time=3778.726..3778.728 rows=0 loops=1)
        Sort Key: ((post_title_combined <-> 'Test text not to be disclosed'::text))
        Sort Method: quicksort  Memory: 25kB
        ->  Bitmap Heap Scan on posts p  (cost=404.11..463.77 rows=5 width=321) (actual time=3778.722..3778.723 rows=0 loops=1)
              Recheck Cond: (post_title_combined % 'Test text not to be disclosed'::text)
              Rows Removed by Index Recheck: 36258
              Filter: ((content IS NOT NULL) AND (is_crawlable IS TRUE) AND (score IS NOT NULL) AND (status = 1) AND ((post_title_combined <-> 'Test text not to be disclosed'::text) < '0.5'::double precision))
              Heap Blocks: exact=24043
              ->  Bitmap Index Scan on trgm_idx  (cost=0.00..404.11 rows=15 width=0) (actual time=187.394..187.394 rows=36916 loops=1)
                    Index Cond: (post_title_combined % 'Test text not to be disclosed'::text)
Planning Time: 8.782 ms
Execution Time: 3778.787 ms```

Solution

  • Your redundant/overlapping query conditions aren't helpful. Setting similarity_threshold=0.3 then doing

    t % q and t <-> q < 0.5 
    

    just throws away index selectivity for no reason. Set similarity_threshold to as stringent of a value as you want to use, then get rid of the unnecessary <-> condition.

    You could try the GiST version of trigram indexing. I can support the ORDER BY ... <-> ... LIMIT 10 operation directly from the index. I doubt it will be very effective with 2000 char strings, but it is worth a try.