Search code examples
pythondatabasepostgresqldatabase-performance

Fastest way to update a few million JSON values in postgres


I have a Postgres table User. User has many attributes, including settings, a JSON (JSONB) dictionary.

I have a CSV file with a few million user IDs, every one of which needs to update the dictionary with the following query:

        UPDATE "User"
        SET preferences = jsonb_set(
            jsonb_set(
                preferences,
                '{setting1}',
                'true'::jsonb,
                true
            ),
            '{setting2}',
            'true'::jsonb,
            true
        )
        WHERE id = '{my-user-id}';"

This basically just updates 2 fields within the dictionary in a single query.

My problem is that in testing, running this sequentially with a python script yielded only a few hundred updates per second, which would take many hours to complete.

How can I speed this up?


Solution

  • I assume you have it, but asking just for case - the Update in a loop requires an index on the column ID - without it a seqScan in a loop is open which leads to a catastrofic performance.

    Anyway far better approach (as commented) is to use a single update.

    I'd recomend the create a helper table containg all ID's that should be updated (I call it upd) and than use a single updateas follows

        UPDATE test
        SET preferences = jsonb_set(
            jsonb_set(
                preferences,
                '{setting1}',
                'true'::jsonb,
                true
            ),
            '{setting2}',
            'true'::jsonb,
            true
        )
        WHERE id in (select id from upd);
    

    This will trigger a hash join that makes the update efficient - my test show an update of 1M row in 6+ sec.

    Test

    create table test as
    with dt as (
    select id, 'x' x, 'y' y, 'z' z 
    from generate_series(1,1000000) t(id)
    )
    select id, to_jsonb(row(id, x, y, z)) preferences from dt;
     
    create table upd as
    select  id id
    from generate_series(1,1000000) t(id);
    

    Query Plan

    Update on test  (cost=18534.19..59754.97 rows=0 width=0) (actual time=6113.324..6113.329 rows=0 loops=1)
      Buffers: shared hit=3034860 read=11699 dirtied=27083 written=15385, temp read=22408 written=25430
      ->  Hash Join  (cost=18534.19..59754.97 rows=721614 width=44) (actual time=1240.077..4040.943 rows=1000000 loops=1)
            Hash Cond: (test.id = upd.id)
            Buffers: shared hit=4096 read=11693 dirtied=11693, temp read=22408 written=25430
            ->  Seq Scan on test  (cost=0.00..25796.28 rows=1443228 width=42) (actual time=0.048..200.050 rows=1000000 loops=1)
                  Buffers: shared hit=2048 read=9316 dirtied=9316
            ->  Hash  (cost=18531.69..18531.69 rows=200 width=10) (actual time=1239.799..1239.801 rows=1000000 loops=1)
                  Buckets: 131072 (originally 1024)  Batches: 16 (originally 1)  Memory Usage: 3716kB
                  Buffers: shared hit=2048 read=2377 dirtied=2377, temp read=7442 written=14119
                  ->  HashAggregate  (cost=18529.69..18531.69 rows=200 width=10) (actual time=432.261..1051.496 rows=1000000 loops=1)
                        Group Key: upd.id
                        Batches: 101  Memory Usage: 5481kB  Disk Usage: 32064kB
                        Buffers: shared hit=2048 read=2377 dirtied=2377, temp read=7442 written=10464
                        ->  Seq Scan on upd  (cost=0.00..15708.75 rows=1128375 width=10) (actual time=0.018..147.495 rows=1000000 loops=1)
                              Buffers: shared hit=2048 read=2377 dirtied=2377
    Planning:
      Buffers: shared hit=10
    Planning Time: 0.146 ms
    Execution Time: 6120.546 ms