Search code examples
sqlpostgresqlexists

Why does INSERT as SELECT with NOT EXISTS cause a unique violation?


Postgres version is 14.10. The table has nothing else about it (i.e. no row level security, etc.):

create schema perf;
create table perf.dba_hist_sqltext(
  queryid bigint primary key
, query text);

What I don't understand is how this does not work:
(even though it does on db<>fiddle)

WITH uniqstatements AS (
    SELECT ss.queryid,
           min(ss.query) query
    FROM pg_stat_statements AS ss
    GROUP by ss.queryid )
INSERT INTO perf.dba_hist_sqltext 
SELECT ss.queryid,
       ss.query
FROM uniqstatements AS ss
WHERE NOT EXISTS (
      SELECT 'X'
      FROM perf.dba_hist_sqltext AS st
      WHERE st.queryid = ss.queryid);

enter image description here

The WITH clause returns a single row for each queryid. The main query should run the WITH clause, then check each row against the target table being inserted into, and if it doesn't already exist, insert it. However, as you can see, something isn't working as expected.

I started this query with an in-line view instead of the WITH clause, and it has the same error, so I tried with the WITH clause instead.

I have been working with SQL for over 30 years and I am baffled by this. The NOT EXISTS should not allow a duplicate to get in. Why perform that action if it doesn't work.

What am I missing here?


Solution

    1. REINDEX INDEX dhst_queryid_pk; in case it got corrupted.
    2. If the unique constraint is currently deferred, concurrent worker might be beating you to the insert and you only find out about it at the end of this statement or the transaction it's in.
    3. Obfuscation by RLS is pretty exotic but entirely possible.
    4. Misuse of the rule system can lead to this sort of paranoia, similar to long and convoluted chains of triggers (constraint triggers can also be deferred).

    Index corruption was first because all else is usually visible upon closer inspection. Also, if you're on 14.10, this could've taken place back in 14.0 before it got fixed in 14.1:

    Ensure that parallel VACUUM doesn't miss any indexes (Peter Geoghegan, Masahiko Sawada)

    A parallel VACUUM would fail to process indexes that are below the min_parallel_index_scan_size cutoff, if the table also has at least two indexes that are above that size. This could result in those indexes becoming corrupt, since they'd still contain references to any heap entries removed by the VACUUM; subsequent queries using such indexes would be likely to return rows they shouldn't. This problem does not affect autovacuum, since it doesn't use parallel vacuuming. However, it is advisable to reindex any manually-vacuumed tables that have the right mix of index sizes.

    Or in 14.1, fixed in 14.2:

    Enforce standard locking protocol for TOAST table updates, to prevent problems with REINDEX CONCURRENTLY (Michael Paquier)

    If applied to a TOAST table or TOAST table's index, REINDEX CONCURRENTLY tended to produce a corrupted index. This happened because sessions updating TOAST entries released their ROW EXCLUSIVE locks immediately, rather than holding them until transaction commit as all other updates do. The fix is to make TOAST updates hold the table lock according to the normal rule. Any existing corrupted indexes can be repaired by reindexing again.

    Or 14.3, fixed in 14.4:

    Prevent possible corruption of indexes created or rebuilt with the CONCURRENTLY option (Álvaro Herrera)

    An optimization added in v14 caused CREATE INDEX ... CONCURRENTLY and REINDEX ... CONCURRENTLY to sometimes miss indexing rows that were updated during the index build. Revert that optimization. It is recommended that any indexes made with the CONCURRENTLY option be rebuilt after installing this update. (Alternatively, rebuild them without CONCURRENTLY.)

    Luckily, nothing in 14.11, 14.12 and 14.13 release notes suggests your current 14.10 could still corrupt a btree.

    The example isn't reproducible, suggesting it's something about the environment rather than the table setup or the statement:
    demo at db<>fiddle

    create schema perf;
    create table perf.dba_hist_sqltext(
      queryid bigint primary key
    , query text);
    
    create table pg_stat_statements(queryid,query)as values
    (1,'select*from everything join all_else on true;');
    
    WITH uniqstatements AS (
        SELECT ss.queryid,
               min(ss.query) query
        FROM pg_stat_statements AS ss
        GROUP by ss.queryid )
    INSERT INTO perf.dba_hist_sqltext 
    SELECT ss.queryid,
           ss.query
    FROM uniqstatements AS ss
    WHERE NOT EXISTS (
          SELECT 'X'
          FROM perf.dba_hist_sqltext AS st
          WHERE st.queryid = ss.queryid)
    RETURNING *;
    
    queryid query
    1 select*from everything join all_else on true;
    INSERT 0 1
    
    WITH uniqstatements AS (
        SELECT ss.queryid,
               min(ss.query) query
        FROM pg_stat_statements AS ss
        GROUP by ss.queryid )
    INSERT INTO perf.dba_hist_sqltext 
    SELECT ss.queryid,
           ss.query
    FROM uniqstatements AS ss
    WHERE NOT EXISTS (
          SELECT 'X'
          FROM perf.dba_hist_sqltext AS st
          WHERE st.queryid = ss.queryid)
    RETURNING *;
    
    queryid query
    INSERT 0 0