Search code examples
sqlpostgresqlaggregate-functionscommon-table-expression

Simplifying WHERE (NOT) IN (...) AND WHERE (NOT) IN (...)


In this dbfiddle demo I have a DELETE FROM... WHERE at the end like so:

    ......
    DELETE FROM data_table 
      WHERE
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel1)
      AND
        (location, param_id, ref_time, fcst_time) NOT IN (SELECT location, param_id, ref_time, fcst_time FROM sel2);

Although it works, it seems unnecessarily wordy, and possibly also not optimal in terms of performance?

Is there any way in which this can be simplified, e.g. by using a single NOT IN statement?

Because of the way that they are defined, there is no overlap/intersection between sel1 and sel2.

status_table has 8033 rows

data_table has 116432724 rows

sel1 has 61860084 rows

sel2 has 53706188 rows

Numbers don't necessarily add up because it's a live database and data is going in all the time. And yes, at the moment there are few if any rows to delete because most/all of the data is current.


Solution

  • Answering the original question in this thread:

    ......
    DELETE FROM data_table 
      WHERE
        (location, param_id, ref_time, fcst_time) 
            NOT IN 
        (SELECT location, param_id, ref_time, fcst_time FROM sel1
            UNION ALL
         SELECT location, param_id, ref_time, fcst_time FROM sel2);
    

    Changing WHERE (a) NOT IN (b) AND NOT IN (c) to WHERE (a) NOT IN (b UNION c) not only shortens the expression, but also provides some performance improvements. Depending on how big an overlap there is between b and c, you can pick either UNION or UNION ALL:

    • WHERE (a) NOT IN (b UNION c) removes duplicates between b and c which adds a small cost initially but can speed things up later when a is compared against the effectively smaller union.
    • WHERE (a) NOT IN (b UNION ALL c) skips deduplication, so if by design there can't be any, it's the better choice.

    Here's a demo showing how plans change. Locally, I also tested a delete from 120m row table against 2x 10m tables with 99% overlap and got 118s for initial version, 105s for union, 98s for union all. With all the same counts and no overlap, it was 118s, 103s, 95s. With more columns and more complex types, I expect the difference to be more pronounced.

    I figured I'd make my comment into an answer, adding the tests that followed, leaving the broader optimisation to be discussed in the new thread.