I have a Postgres table User
. User has many attributes, including settings
, a JSON (JSONB) dictionary.
I have a CSV file with a few million user IDs, every one of which needs to update the dictionary with the following query:
UPDATE "User"
SET preferences = jsonb_set(
jsonb_set(
preferences,
'{setting1}',
'true'::jsonb,
true
),
'{setting2}',
'true'::jsonb,
true
)
WHERE id = '{my-user-id}';"
This basically just updates 2 fields within the dictionary in a single query.
My problem is that in testing, running this sequentially with a python script yielded only a few hundred updates per second, which would take many hours to complete.
How can I speed this up?
I assume you have it, but asking just for case - the Update
in a loop requires an index on the column ID
- without it a seqScan in a loop is open which leads to a catastrofic performance.
Anyway far better approach (as commented) is to use a single update.
I'd recomend the create a helper table containg all ID
's that should be updated (I call it upd
) and than use a single update
as follows
UPDATE test
SET preferences = jsonb_set(
jsonb_set(
preferences,
'{setting1}',
'true'::jsonb,
true
),
'{setting2}',
'true'::jsonb,
true
)
WHERE id in (select id from upd);
This will trigger a hash join that makes the update efficient - my test show an update of 1M row in 6+ sec.
Test
create table test as
with dt as (
select id, 'x' x, 'y' y, 'z' z
from generate_series(1,1000000) t(id)
)
select id, to_jsonb(row(id, x, y, z)) preferences from dt;
create table upd as
select id id
from generate_series(1,1000000) t(id);
Query Plan
Update on test (cost=18534.19..59754.97 rows=0 width=0) (actual time=6113.324..6113.329 rows=0 loops=1)
Buffers: shared hit=3034860 read=11699 dirtied=27083 written=15385, temp read=22408 written=25430
-> Hash Join (cost=18534.19..59754.97 rows=721614 width=44) (actual time=1240.077..4040.943 rows=1000000 loops=1)
Hash Cond: (test.id = upd.id)
Buffers: shared hit=4096 read=11693 dirtied=11693, temp read=22408 written=25430
-> Seq Scan on test (cost=0.00..25796.28 rows=1443228 width=42) (actual time=0.048..200.050 rows=1000000 loops=1)
Buffers: shared hit=2048 read=9316 dirtied=9316
-> Hash (cost=18531.69..18531.69 rows=200 width=10) (actual time=1239.799..1239.801 rows=1000000 loops=1)
Buckets: 131072 (originally 1024) Batches: 16 (originally 1) Memory Usage: 3716kB
Buffers: shared hit=2048 read=2377 dirtied=2377, temp read=7442 written=14119
-> HashAggregate (cost=18529.69..18531.69 rows=200 width=10) (actual time=432.261..1051.496 rows=1000000 loops=1)
Group Key: upd.id
Batches: 101 Memory Usage: 5481kB Disk Usage: 32064kB
Buffers: shared hit=2048 read=2377 dirtied=2377, temp read=7442 written=10464
-> Seq Scan on upd (cost=0.00..15708.75 rows=1128375 width=10) (actual time=0.018..147.495 rows=1000000 loops=1)
Buffers: shared hit=2048 read=2377 dirtied=2377
Planning:
Buffers: shared hit=10
Planning Time: 0.146 ms
Execution Time: 6120.546 ms