postgresql performance database-performance query-performance pg-trgm

Postgres `gin_trgm_ops` index not being used

I'm trying to speed up some text matching in Postgres, using the pg_trgm extensions:

CREATE TABLE test3 (id bigint, key text, value text);

insert into test3 values (1, 'first 1', 'second 3');
insert into test3 values (2, 'first 1', 'second 2');
insert into test3 values (2, 'first 2', 'second 3');
insert into test3 values (3, 'first 1', 'second 2');
insert into test3 values (3, 'first 1', 'second 3');
insert into test3 values (4, 'first 2', 'second 3');
insert into test3 values (4, 'first 2', 'second 3');
insert into test3 values (4, 'first 1', 'second 2');
insert into test3 values (4, 'first 1', 'second 2');

-- repeat the above 1,000,000x times, to have more rows for benchmarking
insert into test3(id, key, value) select id, key, value from test3 cross join generate_series(1, 1000000);

Now I query this table with ILIKE:

select count(*) from test3 where key = 'first 1' and value ilike '%nd 3%';
Time: 918.265 ms

To see if indexing would speed this up, I added pg_trgm on both key and value columns:

CREATE extension if not exists pg_trgm;
CREATE INDEX test3_key_trgm_idx ON test3 USING gin (key gin_trgm_ops);
CREATE INDEX test3_value_trgm_idx ON test3 USING gin (value gin_trgm_ops);

But the query still takes the same time, and EXPLAIN ANALYZE shows the indexes are not being used at all:

explain analyze select count(*) from test3 where key = 'first 1' and value ilike '%nd 3%';
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=126905.14..126905.15 rows=1 width=8) (actual time=1017.666..1017.667 rows=1 loops=1)
   ->  Gather  (cost=126904.93..126905.14 rows=2 width=8) (actual time=1017.505..1018.778 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=125904.93..125904.94 rows=1 width=8) (actual time=1010.862..1010.862 rows=1 loops=3)
               ->  Parallel Seq Scan on test3  (cost=0.00..122427.06 rows=1391148 width=0) (actual time=0.041..973.550 rows=666667 loops=3)
                     Filter: ((value ~~* '%nd 3%'::text) AND (key = 'first 1'::text))
                     Rows Removed by Filter: 2333336
 Planning Time: 0.266 ms
 Execution Time: 1018.814 ms

Time: 1049.413 ms (00:01.049)

Note the sequential scan. What gives?

Solution

Never mind, I found the issue.

The query planner was smarter than my toy test set; seeing as most rows match the query, it went for a sequential scan.

If I try with ilike '%nd 0%' instead, no rows match and EXPLAIN ANALYZE reports Bitmap Index Scan on test3_value_trgm_idx correctly.

So, normalizing the original JSONB in this manner works. But I'll also try to find and compare another way, using regular expressions over TEXT, to avoid having to create and maintain another table.