Search code examples
sqlperformancedistinctnotin

Efficient way to select all values from one column not in another column


I need to return all values from colA that are not in colB from mytable. I am using:

SELECT DISTINCT(colA) FROM mytable WHERE colA NOT IN (SELECT colB FROM mytable)

It is working however the query is taking an excessively long time to complete.

Is there a more efficient way to do this?


Solution

  • In standard SQL there are no parentheses in DISTINCT colA. DISTINCT is not a function.

    SELECT DISTINCT colA
    FROM   mytable
    WHERE  colA NOT IN (SELECT DISTINCT colB FROM mytable);
    

    Added DISTINCT to the sub-select as well. For many duplicates it can speed up the query. But, typically, there are more efficient query techniques than NOT IN for this:

    The deciding factor for speed will be indexes. You need indexes on colA and colB for this query to be fast.

    A CTE might be faster, depending on your DBMS. I additionally demonstrate LEFT JOIN as alternative to exclude the values in valB, and an alternative way to get distinct values with GROUP BY:

    WITH x AS (SELECT colB FROM mytable GROUP BY colB)
    SELECT m.colA
    FROM   mytable m
    LEFT   JOIN x ON x.colB = m.colA
    WHERE  x.colB IS NULL
    GROUP  BY m.colA;
    

    Or, simplified further, and with a plain subquery (probably fastest):

    SELECT DISTINCT m.colA
    FROM   mytable m
    LEFT   JOIN mytable x ON x.colB = m.colA
    WHERE  x.colB IS NULL;