Search code examples
postgresqlnullupsertpostgresql-10unique-index

How to merge two tables with possible NULL values in the UNIQUE index?


How to merge (upsert & delete orphan rows) to tableA?

tableA:

+---------+--------+----------+-------+
| company | option | category | rates |
+---------+--------+----------+-------+
| a       | f      | null     | 2.5   |
+---------+--------+----------+-------+
| a       | f      | d        | 2     | *
+---------+--------+----------+-------+
| a       | g      | e        | 3     | **
+---------+--------+----------+-------+
| c       | g      | e        | 4     |
+---------+--------+----------+-------+
| d       | f      | d        | 1     |
+---------+--------+----------+-------+

* denotes orphan row*.
** denotes value to change (3 -> 4).

Only touch companies existing in tableB (a & c in the example, leave d alone).

tableB:

+---------+--------+----------+-------+
| company | option | category | rates |
+---------+--------+----------+-------+
| a       | f      | null     | 2.5   |
+---------+--------+----------+-------+
| a       | g      | e        | 4     |
+---------+--------+----------+-------+
| c       | g      | e        | 4     |
+---------+--------+----------+-------+

There is a unique index on (company, option, category) in both tables.

Desired resulting tableA:

+---------+--------+----------+-------+
| company | option | category | rates |
+---------+--------+----------+-------+
| a       | f      | null     | 2.5   |
+---------+--------+----------+-------+
| a       | g      | e        | 4     | <-
+---------+--------+----------+-------+
| c       | g      | e        | 4     |
+---------+--------+----------+-------+
| d       | f      | d        | 1     |
+---------+--------+----------+-------+

Only the second row (a,f,d,2) was deleted and rates was changed from 3 to 4 for (a,g,e).

Here is a fiddle: https://rextester.com/QUVC30763

I'm thinking to first delete the orphan row with this:

DELETE from tableA
 USING tableB
 WHERE 
   -- ignore rows with IDs that don't exist in tableB
   tableA.company = tableB.company
   -- ignore rows that have an exact all-column match in tableB
   AND NOT EXISTS 
      (select * from tableB 
      where tableB.company is not distinct from tableA.company 
      AND tableB.option is not distinct from tableA.option 
      AND tableB.category is not distinct from tableA.category );

Then upsert with this:

 INSERT INTO tableA (company, option, category, rates) 
   SELECT company, option, category, rates
   FROM   tableB
 ON CONFLICT (company, option, category) 
 DO update
   set rates= EXCLUDED.rates
 WHERE 
      tableA.rates IS DISTINCT FROM 
      EXCLUDED.rates;

But the problem with the upsert function is that it can't handle nullable fields. I have to set -1 in place of null or else the function won't be able to know if there are duplicates or not. I feel like setting -1 in place of null will create many workarounds in the future, so I'd like to avoid that if I can.

Note: I found that INSERT ... ON CONFLICT ... DO UPDATE is probably the way to go:

But I haven't seen a query suitable for my case. And I'm not sure if it's possible with nullable fields. Hence the question:
Is there a clean way to merge with nullable fields?


Solution

  • I think you are on the right path. But there is a design problem with NULL vs. UNIQUE:

    The columns option and category can be NULL. And NULL is to be considered equal in those cases. Your current unique indexes do not consider NULL values to be equal, hence do not enforce your requirements. This creates ambiguities even before you start to merge. The NULL value is no good for what you are trying to implement. Working around this will create a lot more work and additional points of failure. Consider using a special value instead of NULL and everything falls into place. You were considering -1. Anything that naturally makes sense for your actual data type and the nature of the attribute.

    That said, the DELETE has an additional, subtly hidden problem: it would try to delete orphan rows as many times as there are matches on company in tableB. Nothing breaks since excess attempts do nothing, but it's needlessly expensive. Use EXISTS twice instead:

    DELETE FROM tableA a
    WHERE  EXISTS (
       SELECT FROM tableB b
       WHERE a.company = b.company
       )
    AND    NOT EXISTS (
       SELECT FROM tableB b
       WHERE (a.company, a.option, a.category) IS NOT DISTINCT FROM
             (b.company, b.option, b.category)
       );
    

    If you insist on working with NULL values, splitting the UPSERT into UPDATE, followed by INSERT ... ON CONFLICT DO NOTHING would be the workaround. Simpler and cheaper if you don't have concurrent writes to the table. ON CONFLICT DO NOTHING works without specifying a conflict target, so you could implement your requirements with several partial indexes and get this to work. The manual:

    For ON CONFLICT DO NOTHING, it is optional to specify a conflict_target; when omitted, conflicts with all usable constraints (and unique indexes) are handled. For ON CONFLICT DO UPDATE, a conflict_target must be provided.

    But if you fix your schema with a working UNIQUE index or constraint, the UPSERT you already have serves nicely.

    And make sure there are no concurrent writes to the table or you may face race conditions and / or deadlocks unless you do more ...