Search code examples
postgresqlindexingpg-trgm

Is there a way to use pg_trgm like operator with btree indexes on PostgreSQL?


I have two tables:

  • table_1 with ~1 million lines, with columns id_t1: integer, c1_t1: varchar, etc.
  • table_2 with ~50 million lines, with columns id_t2: integer, ref_id_t1: integer, c1_t2: varchar, etc.

ref_id_t1 is filled with id_t1 values , however they are not linked by a foreign key as table_2 doesn't know about table_1.

I need to do a request on both table like the following:

SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');

Without any change or with basic indexes the request takes about a minute to complete as a sequencial scan is peformed on table_2. To prevent this I created a GIN idex with gin_trgm_ops option:

CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);

However this does not solve the problem as the inner request still takes a very long time.

EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'

Gives the following

Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
  Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
  Rows Removed by Index Recheck: 49069703
  Heap Blocks: exact=611548
  ->  Bitmap Index Scan on gin_trg  (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
        Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms

The Bitmap Index Scan is fast, but as we need t2.ref_id_t1 PostgreSQL needs to perform an Bitmap Heap Scan which is not quick on 65000 lines of data.

The solution to avoid the Bitmap Heap Scan would be to perform an Index Only Scan. This is possible using multiple column with btree indexes, see https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html

If I change the request like to search the begining of c1_t2, even with the inner request returning 90000 lines, and if I create a btree index on c1_t2 and ref_id_t1 the request takes just over a second.

CREATE INDEX c1_t2_ref_id_t1_index
    ON table_2  USING btree
    (c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)


EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
    (SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');

Hash Join  (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
  Hash Cond: (t1.id_t1 = t2.ref_id_t1)
  ->  Seq Scan on table_1 t1  (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
        Filter: (c1_t1 = 'A')
        Rows Removed by Filter: 1083798
  ->  Hash  (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
        Buckets: 131072 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 3472kB
        ->  HashAggregate  (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
              Group Key: t2.ref_id_t1
              ->  Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2   (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
                    Index Cond: ((c1_t2  ~>=~ 'aaa'::text) AND (c1_t2  ~<~ 'chb'::text))
                    Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
                    Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms

However this is not possible with gin indexes, as these indexes don't store all data in the key.

Is there a way to use pg_trmg like extension with btree index so we can have index only scan with LIKE '%abc%' requests?


Solution

  • The problem was apparently caused by a problem with the planner for this particular request '%1.1%'. This request is always very low when the number of results from table_1 is too high, however, so I changed the data structure and integrated table_2 in table_1 in a jsonb column, which gives OK results in all cases.