Search code examples
sqlpostgresqlperformancequery-optimization

Upsert vs case when for table update


Let's say I want to modify multiple rows of a table with different values. I want to know which approach will be faster and which is in general better: Doing an upsert on the table or modifying it with multiple case when statements.

create table example (
  id integer NOT NULL,
  col1 varchar(255) NOT NULL,
  col2 varchar(255) NOT NULL,
  CONSTRAINT example_pkey PRIMARY KEY (id),
)

-- GOAL: Update col2 based on id

-- Upsert example
-- Note: id 5 and 7 exists in the table
INSERT INTO example
  (id, col1, col2)
  VALUES
  (5, 'dummy value', 'actual value to update 1'),
  (7, 'dummy value', 'actual value to update 2')
ON CONFLICT (id) DO UPDATE
  SET col2=EXCLUDED.col2



-- case when example
UPDATE example
  SET col2=(
        CASE 
          WHEN id=5 THEN 'actual value to update 1' 
          WHEN id=7 THEN 'actual value to update 2' 
        END
      )
WHERE
  id in (5, 7)

As per my understanding,the first approach can be a bit misleading as we only want to update the values and have no intention of doing insertion but apart from that are there any major benefits of one over the other.

P.S.: PostgreSQL specific information will be more than welcome.


Solution

  • If there is no intent to perform an insert, then don't use an insert statement. Getting the desired result is important, but it is also important for the code to clearly express the intended operation. The case statement approach is problematic because the IDs have to be redundantly specified. A better approach is the following:

    WITH new_values(id, new_value) AS
      VALUES (5, 'new_value1'),
             (7, 'new_value2')
    )
    UPDATE example
      SET col2 = new_values.new_value
      FROM new_values
      WHERE example.id = new_values.id;
    

    When updating only a few rows, there is no significant difference in execution time. When a large number of rows are being updated, then the execution time can vary much more. A number of variables affect the relative performance, including: indexing, table size, row size, and the number of rows to be updated. Intuition might lead one to assume that direct updates would be more efficient than upserts; however, that is not necessarily the case.

    The following SQL creates a test bed to help evaluate the relative performance:

    CREATE TABLE upsert_target (
      id integer PRIMARY KEY,
      col1 text,
      padding text);
    
    CREATE TABLE upsert_source (
      id integer,
      new_value text);
    
    CREATE UNIQUE INDEX upsert_source_full_row ON upsert_source(id) INCLUDE (new_value);
    

    The test conditions are established with the following SQL:

    WITH
      parms(num_rows, num_updates, padding_size, format) AS (
        VALUES (1000000, 100000, 200, '0000000000')),
      trim_target AS (
        DELETE FROM upsert_target
          WHERE id > (SELECT num_rows FROM parms)),
      gen_target AS (
        INSERT INTO upsert_target(id, col1, padding)
          SELECT s.n, 'V_' || to_char(s.n, parms.format), rpad('', parms.padding_size)
            FROM parms
              CROSS JOIN LATERAL generate_series((SELECT COUNT(*) + 1 FROM upsert_target), parms.num_rows) s(n)),
      trim_source AS (
        DELETE FROM upsert_source
          WHERE (SELECT COUNT(*) <> parms.num_updates FROM parms CROSS JOIN upsert_source GROUP BY parms.num_updates))
    INSERT INTO upsert_source(id, new_value)
      SELECT n, 'V_' || to_char(n, parms.format)
        FROM parms
          CROSS JOIN LATERAL generate_series(1, parms.num_rows, parms.num_rows / parms.num_updates) s(n)
      ON CONFLICT (id) DO UPDATE SET new_value = EXCLUDED.new_value;
    

    The CTE trim_target helps reduce the time needed to generate a test setup when only num_rows or num_updates are changed. If either padding_size or format are changed, then update_target should be truncated prior to running this query.

    VACUUM FULL is run to minimize the impact of dead tuples on the query execution times.

    VACUUM FULL upsert_target;
    
    VACUUM FULL upsert_source;
    

    Running this query

    EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
    INSERT INTO upsert_target(id, col1)
      SELECT s.id, s.new_value
        FROM upsert_source s
      ON CONFLICT (id) DO UPDATE SET col1 = excluded.col1;
    

    produced the following results:

    Insert on public.upsert_target  (cost=0.00..1637.00 rows=0 width=0) (actual time=3025.643..3025.644 rows=0 loops=1)
      Conflict Resolution: UPDATE
      Conflict Arbiter Indexes: upsert_target_pkey
      Tuples Inserted: 0
      Conflicting Tuples: 100000
      Buffers: shared hit=1140923 dirtied=3125 written=3125
      ->  Seq Scan on public.upsert_source s  (cost=0.00..1637.00 rows=100000 width=50) (actual time=0.011..36.418 rows=100000 loops=1)
            Output: s.id, s.new_value, NULL::text
            Buffers: shared hit=637
    Planning Time: 0.066 ms
    Execution Time: 3025.741 ms
    

    Using the same data with the following query

    EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
    UPDATE upsert_target t
      SET col1 = s.new_value
      FROM upsert_source s
      WHERE t.id = s.id;
    

    gave these results:

    Update on public.upsert_target t  (cost=2887.00..56176.44 rows=0 width=0) (actual time=3550.676..3550.679 rows=0 loops=1)
      Buffers: shared hit=756643 read=28029 dirtied=217
      ->  Hash Join  (cost=2887.00..56176.44 rows=100000 width=26) (actual time=864.998..997.826 rows=100000 loops=1)
            Output: s.new_value, t.ctid, s.ctid
            Inner Unique: true
            Hash Cond: (t.id = s.id)
            Buffers: shared hit=13233 read=28029
            ->  Seq Scan on public.upsert_target t  (cost=0.00..50656.23 rows=1003123 width=10) (actual time=0.042..451.831 rows=1000000 loops=1)
                  Output: t.ctid, t.id
                  Buffers: shared hit=12596 read=28029
            ->  Hash  (cost=1637.00..1637.00 rows=100000 width=24) (actual time=67.392..67.394 rows=100000 loops=1)
                  Output: s.new_value, s.ctid, s.id
                  Buckets: 131072  Batches: 1  Memory Usage: 6493kB
                  Buffers: shared hit=637
                  ->  Seq Scan on public.upsert_source s  (cost=0.00..1637.00 rows=100000 width=24) (actual time=0.013..24.261 rows=100000 loops=1)
                        Output: s.new_value, s.ctid, s.id
                        Buffers: shared hit=637
    Planning:
      Buffers: shared hit=18
    Planning Time: 0.266 ms
    Execution Time: 3552.031 ms
    

    Changing the number of rows to be updated to 1000 resulted in execution times of ~100ms for both approaches. For 10,000 updates, the execution time using INSERT...ON CONCFLICT averaged ~330ms and for UPDATE averaged ~190ms. There was quite a lot of variation in execution time between runs regardless of which approach is used. Because of this, it's important to average several runs to ensure that a single outlier doesn't lead to a false conclusion.

    From this limited exploration, it appears that INSERT...ON CONFLICT performs slightly better for larger update sets than does a direct UPDATE; however, the difference isn't great enough to justify obscuring the code's intended operation. Efficiency is an important code quality, but so is readability.

    These results were produced running PostgreSQL v15.3 on a 15-inch, 2017 MacBook Pro with 2.8 GHz Quad-Core Intel Core i7, 16GB RAM, 500GB SSD, and MacOS Ventura 13.4.1 (c).