I have a TEXT keyvalues
column in Postgres:
select * from test5 limit 5;
id | keyvalues
----+------------------------------------------------------
1 | ^ first 1 | second 3
2 | ^ first 1 | second 2 ^ first 2 | second 3
3 | ^ first 1 | second 2 | second 3
4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
5 | ^ first 2 | second 3 ^ first 1 | second 3
My queries must exclude the ^
character from the middle of the match, so I'm using regular expressions:
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
-> Gather (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
-> Parallel Seq Scan on test5 (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Filter: 1666668
Planning Time: 0.068 ms
Execution Time: 7337.184 ms
The query works (zero rows match), but is way too slow at > 7 seconds.
I thought indexing with trigrams would help, but no luck:
create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);
explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
-> Bitmap Heap Scan on test5 (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Rows Removed by Index Recheck: 5000005
Heap Blocks: exact=47620
-> Bitmap Index Scan on test5_keyvalues_idx (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
Planning Time: 0.412 ms
Execution Time: 23734.722 ms
The query with the trigram index is 3x slower! It still returns the correct result (zero rows). I expected the trigram index to figure out immediately there's no second 0
string anywhere, and be super fast.
(Motivation: I want to avoid normalizing the keyvalues
into another table, so I'm looking to encode the matching logic in a single TEXT
field using text indexing and regexps instead. The logic works, but is too slow, as is JSONB.)
According to the OP, the correct answer was given here on DBA.SE by user @jjanes:
I expected the trigram index to figure out immediately there's no
second 0
string anywhere'second' and '0' are separate words, so it cannot detect their joint absence as such. It seems like it could detect the absence of ' 0', but this comment from "contrib/pg_trgm/trgm_regexp.c" seems pertinent:
* Note: Using again the example "foo bar", we will not consider the * trigram " b", though this trigram would be found by the trigram * extraction code. Since we will find " ba", it doesn't seem worth * trying to hack the algorithm to generate the additional trigram.
Since 0 is the last character in the pattern string, there will be no trigram of the form " 0a", either, so it just misses that opportunity.
Even if it were not for this limitation, your approach seems extremely fragile.