Running a local docker instance of Postgres 12.5 (with 4MB work_mem
). I'm implementing this pattern to search arbitrary fields in json. Goal is to search and return the JSON column profile
quickly.
I tried an EXISTS
subquery on a materialized view:
CREATE TABLE end_user (
id varchar NOT NULL,
environment_id varchar NOT NULL,
profile jsonb NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT end_user_pkey PRIMARY KEY (environment_id, id)
);
CREATE INDEX end_user_environment_id_idx ON private.end_user USING btree (environment_id);
CREATE INDEX end_user_id_idx ON private.end_user USING btree (id);
CREATE INDEX end_user_profile_idx ON private.end_user USING gin (profile);
CREATE MATERIALIZED VIEW user_profiles AS
SELECT u.environment_id, u.id, j.key, j.value
FROM end_user u, jsonb_each_text(u.profile) j(key, value);
CREATE UNIQUE INDEX on user_profiles (environment_id, id, key);
CREATE INDEX user_profile_trgm_idx ON user_profiles using gin (value gin_trgm_ops);
I have this query which is indexed correctly so that it executes in a few milliseconds over a million rows. ✅
select * from user_profiles
where value ilike '%auckland%' and key = 'timezone' and environment_id = 'test';
Exec time 42ms 😘
Bitmap Heap Scan on user_profiles (cost=28935.65..62591.44 rows=9659 width=65)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Filter: ((environment_id)::text = 'test'::text)
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0)
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0)
Index Cond: (value ~~* '%auckland%'::text)
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0)
Index Cond: (key = 'timezone'::text)
However if I use it with an exists
query in order to build up conditions like so:
select * users u
where
environment_id = 'test'
and exists (
select 1 from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and p.id = u.id
and environment_id = 'test'
)
It performs very slowly.
Exec time 17.44 Seconds 💀
Nested Loop (cost=62616.01..124606.45 rows=9658 width=1459) (actual time=19206.818..28444.491 rows=332572 loops=1)
Buffers: shared hit=952734 read=624101
-> HashAggregate (cost=62615.59..62707.52 rows=9193 width=15) (actual time=19205.238..19292.998 rows=332572 loops=1)
Group Key: (p.id)::text
Buffers: shared hit=373 read=246174
-> Bitmap Heap Scan on user_profiles p (cost=28935.65..62591.44 rows=9659 width=15) (actual time=278.211..18942.629 rows=332572 loops=1)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 17781109
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=43928 lossy=197955
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=272.626..272.629 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=177.577..177.577 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=92.586..92.589 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.79 rows=1 width=1459) (actual time=0.027..0.027 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=952361 read=377927
Planning Time: 19.002 ms
Execution Time: 28497.570 ms
This is a shame as the exists
would be handy if it was fast as I would be able to add more conditions dynamically in my application code with extra conditions represented as additional exists
clauses.
BTW a lateral join does speed this up, but I don't understand how I have such a big difference:
select * from users u,
lateral (
select id from user_profiles p
where
value ilike '%auckland%'
and key = 'timezone'
and environment_id = u.environment_id
and p.id = u.id
) ss
where u.environment_id = 'test';
Exec time 304ms 👍
Gather (cost=29936.07..91577.38 rows=9658 width=1474) (actual time=1100.824..15430.620 rows=332572 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1140551 read=436286
-> Nested Loop (cost=28936.07..89611.58 rows=4024 width=1474) (actual time=602.490..14805.285 rows=110857 loops=3)
Buffers: shared hit=1140551 read=436286
-> Parallel Bitmap Heap Scan on user_profiles p (cost=28935.65..62492.84 rows=4025 width=22) (actual time=602.078..12247.891 rows=110857 loops=3)
Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))
Rows Removed by Index Recheck: 5927036
Filter: ((environment_id)::text = 'test'::text)
Heap Blocks: exact=14659 lossy=65588
Buffers: shared hit=373 read=246174
-> BitmapAnd (cost=28935.65..28935.65 rows=9659 width=0) (actual time=1087.258..1087.259 rows=0 loops=1)
Buffers: shared hit=373 read=4291
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=853.075..853.076 rows=332572 loops=1)
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=373 read=455
-> Bitmap Index Scan on user_profiles_key_idx (cost=0.00..26006.62 rows=994408 width=0) (actual time=231.295..231.295 rows=1000000 loops=1)
Index Cond: (key = 'timezone'::text)
Buffers: shared read=3836
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.022..0.022 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=1140178 read=190112
Planning Time: 16.877 ms
Execution Time: 15461.571 ms
Keen to hear any thoughts on why the exists
subquery is so slow, and any other options I could look at here.
Distinct counts as requested by Erwin, note this is dummy data to test load but its reasonably close to production ratios
select count(distinct environment_id) => 4
, count(distinct key) => 33
, count(distinct value) => 15M
from private.user_profiles;
Update after increasing working memory to 16MB as suggested by Erwin:
ALTER SYSTEM SET work_mem to '16MB';
SELECT pg_reload_conf();
Things are looking better with a 500ms execution time for the exists query. Which now explains like.
Gather (cost=3926.79..400754.43 rows=9658 width=1459) (actual time=312.213..9396.610 rows=332572 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1141083 read=431918
-> Nested Loop (cost=2926.79..398788.63 rows=4024 width=1459) (actual time=155.271..8987.721 rows=110857 loops=3)
Buffers: shared hit=1141083 read=431918
-> Parallel Bitmap Heap Scan on user_profiles p (cost=2926.36..371669.88 rows=4025 width=15) (actual time=150.989..2962.870 rows=110857 loops=
Recheck Cond: (value ~~* '%auckland%'::text)
Filter: (((environment_id)::text = 'test'::text) AND (key = 'timezone'::text))
Heap Blocks: exact=82556
Buffers: shared hit=981 read=241730
-> Bitmap Index Scan on user_profile_trgm_idx (cost=0.00..2923.95 rows=320526 width=0) (actual time=243.604..243.605 rows=332572 loops=1
Index Cond: (value ~~* '%auckland%'::text)
Buffers: shared hit=828
-> Index Scan using end_user_id_idx on end_user u (cost=0.42..6.74 rows=1 width=1459) (actual time=0.054..0.054 rows=1 loops=332572)
Index Cond: ((id)::text = (p.id)::text)
Filter: ((environment_id)::text = 'test'::text)
Buffers: shared hit=1140102 read=190188
Planning Time: 9.932 ms
Execution Time: 9427.067 ms
My answer you have been working off is dated. The current Postgres version was 9.4 in July 2015.
In Postgres 12 the whole design can be radically simpler with regular expressions in an SQL/JSON path expression. The manual:
SQL/JSON path expressions allow matching text to a regular expression with the
like_regex
filter.
With index-support, too. Scrap the materialized view. All we need is your original table and an index like:
CREATE INDEX end_user_path_ops_idx ON end_user USING GIN (profile jsonb_path_ops);
This query is equivalent to your original and can use the index:
SELECT *
FROM end_user u
WHERE environment_id = 'test'
AND profile @? '$.timezone ? (@ like_regex "auck" flag "i")';
db<>fiddle here
The one drawback is that the SQL/JSON path language needs getting used to.
Further reading:
A more basic problem becomes obvious in this line of the EXPLAIN
output:
Heap Blocks: exact=14659 lossy=65588
lossy
indicates that you don't have enough work_mem
. Your setting is obviously very low. (The default setting of 4 MB is way to low for a database involving tables with millions of rows.) See:
Chances are, you need to do more in the department of server configuration. And you seem to be limited on RAM in general. I see high "read" counts, which indicates cold cache and/or a lack or mis-configuration of cache memory.
This Postgres Wiki page can help you get started.