Search code examples
sqlsql-serversql-server-2019

Alternatives to OR NULL in SQL join


It is clear from this question that OR conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION statements.

The query I am trying to optimise currently contains a series of OR conditions of the form: a.col = b.col OR a.col IS NULL. For a million rows, we are estimating runtimes measured in hours.

However, as there could be a dozen of these conditions, converting this into the equivalent UNION statements is likely to be cumbersome - with each OR condition doubling the number of subqueries that have to be UNIONed. So I am keen to find a better alternative.

An example of what I am currently working with:

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  (a.col1 = b.col1 OR a.col1 IS NULL)
AND  (a.col2 = b.col2 OR a.col2 IS NULL)
AND  (a.col3 = b.col3 OR a.col3 IS NULL)

I have considered using COALESCE in the join, something like the following. This removes OR from the condition, but I expect the need to calculate the value of the COALESCE is also going to prevent optimising the join.

SELECT  *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON  COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND  COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND  COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)

How can I approach optimising this join pattern?


A few clarifying details in response to comments:

  • While the current query allows for the case where all a cols are null. In most instances I would expect that every a col contains some nulls and every a row contains some nulls, but that no a col or row is completely null.
  • There are a couple of things I might use this pattern for. One is to join subtotals to their supertotal. For example table_b might contain the counts for each combination of age group, region, and country of birth. Then when table_a has nulls in the age column, that row would contain the total for a region and a country of birth - over all age groups.

Solution

  • For the following answer, I have renamed your tables as SearchCriteria and DataToSearch for improved readability.

    I am assuming that SearchCriteria contains one or a few rows, while DataToSearch contains many rows. It is also assumed that appropriate indexes are defined on the DataToSearch table to support reasonably efficient lookups for various combinations of criteria.

    You can generate dynamic SQL (one query per criteria row) that applies just the relevant filter conditions. Non-null values generate a filter condition that is added to the WHERE clause. Null search values are skipped. The results from all of the generated queries can then be combined using UNION ALL.

    DECLARE @NL CHAR = CHAR(10) -- newline
    DECLARE @UnionAll NVARCHAR(100) = @NL + 'UNION ALL' + @NL
    
    DECLARE @sql NVARCHAR(MAX) = (
        SELECT STRING_AGG(Q.Sql, @UnionAll) WITHIN GROUP(ORDER BY S.search_id)
        --SELECT *
        FROM SearchCriteria S
        CROSS APPLY (
            SELECT CONCAT(
                CAST('' AS NVARCHAR(MAX)),
                'SELECT ', S.search_id, ' AS search_id, D.data_id, D.col1, D.col2, D.col3',
                @NL, 'FROM DataToSearch D',
                @NL, 'WHERE 1 = 1',
                -- The following lines will either generate a "AND condition" line
                -- or null (no condition) for cases where the search value is null.
                -- Use the following for exact numeric values (excluding real/float)
                -- If used for text values, the following is open to SQL Injection
                (@NL + 'AND D.col1 = ' + CAST(S.col1 AS VARCHAR(30))),
                (@NL + 'AND D.col2 = ' + CAST(S.col2 AS VARCHAR(30))),
                (@NL + 'AND D.col3 = ' + CAST(S.col3 AS VARCHAR(30))),
                -- Use the following for text values (limit 128 characters)
                --(@NL + 'AND D.col1 = N' + QUOTENAME(S.col1, '''')),
                --(@NL + 'AND D.col2 = N' + QUOTENAME(S.col2, '''')),
                --(@NL + 'AND D.col3 = N' + QUOTENAME(S.col3, '''')),
                -- Use the following for text values (possibly longer than 128 characters)
                --(@NL + 'AND D.col1 = N''' + REPLACE(S.col1, '''', '''''') + ''''),
                --(@NL + 'AND D.col2 = N''' + REPLACE(S.col2, '''', '''''') + ''''),
                --(@NL + 'AND D.col3 = N''' + REPLACE(S.col3, '''', '''''') + ''''),
                -- Use an appropriate mix of the above, if the columns have mixed types
                ''
            ) AS Sql
        ) Q
    )
    
    PRINT @sql
    EXEC sp_executesql @sql
    

    When a criteria column is null, the generated AND column = value expression will also be null, and the CONCAT() function will quietly skip that snippet.

    Extreme care must be taken when injecting values into dynamic SQL to ensure that they are properly quoted and sanitized embedded quotes doubled up). Integer and other numeric values are safe as is, but text values must be carefully quoted and any embedded quotes doubled up. (Example "O'Malley" becomes N'O''Malley'.) If you have any floating point values (SQL REAL or FLOAT types), you would need to cast them using CONVERT(VARCHAR(30), value, 3) to ensure maximum precision is preserved.

    An safer alternative is to use a cursor loop to generate and execute multiple parameterized queries that each add their results to a temp #Results table. The final results are then selected from that temp table.

    CREATE TABLE #Results (
        search_id INT,
        data_id INT,
        col1 INT,
        col2 INT,
        col3 INT,
    )
    
    DECLARE CSR CURSOR FAST_FORWARD FOR
        SELECT *
        FROM SearchCriteria
        ORDER BY search_id
    OPEN CSR
    
    DECLARE @search_id INT,  @col1 INT, @col2 INT, @col3 INT
    
    FETCH NEXT FROM CSR INTO @search_id,  @col1, @col2, @col3
    WHILE @@FETCH_STATUS = 0
    BEGIN
        DECLARE @sql NVARCHAR(MAX) = CONCAT(
            CAST('' AS NVARCHAR(MAX))
            , 'INSERT INTO #Results'
            , CHAR(10) + 'SELECT @search_id, D.data_id, D.col1, D.col2, D.col3'
            , CHAR(10) + 'FROM DataToSearch D'
            , CHAR(10) + 'WHERE 1 = 1'
            , CASE WHEN @col1 IS NOT NULL THEN CHAR(10) + 'AND D.col1 = @col1' END
            , CASE WHEN @col2 IS NOT NULL THEN CHAR(10) + 'AND D.col2 = @col2' END
            , CASE WHEN @col3 IS NOT NULL THEN CHAR(10) + 'AND D.col3 = @col3' END
        )
    
        DECLARE @params NVARCHAR(MAX) = '@search_id INT,  @col1 INT, @col2 INT, @col3 INT'
    
      PRINT @sql
        EXEC sp_executesql @sql, @params, @search_id, @col1, @col2, @col3
      
        FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
    END
    
    CLOSE CSR
    DEALLOCATE CSR
    
    SELECT *
    FROM #Results
    
    DROP TABLE #Results
    

    Both of the above should perform index seeks for nearly all cases, depending on index availability. Cursor and temp table overhead in the second case should be minimal.

    UNION ALL in the above code can be changed to a simple UNION if you wish to eliminate duplicate results. (The search_id select item should also then be removed from the SQL for this to be effective.)

    Given the following SearchCriteria data:

    search_id col1 col2 col3
    1 1 2 3
    2 4 5 null
    3 null 6 7

    The following Dynamic SQL is generated:

    SELECT 1 AS search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col1 = 1
    AND D.col2 = 2
    AND D.col3 = 3
    UNION ALL
    SELECT 2 AS search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col1 = 4
    AND D.col2 = 5
    UNION ALL
    SELECT 3 AS search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col2 = 6
    AND D.col3 = 7
    

    The second version above generates the following sequence of queries, executing each using appropriate parameter values:

    INSERT INTO #Results
    SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col1 = @col1
    AND D.col2 = @col2
    AND D.col3 = @col3
    
    INSERT INTO #Results
    SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col1 = @col1
    AND D.col2 = @col2
    
    INSERT INTO #Results
    SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
    FROM DataToSearch D
    WHERE 1 = 1
    AND D.col2 = @col2
    AND D.col3 = @col3
    

    Using some sample data-to-search that includes a mix of null and non-null values, the following results are obtained:

    search_id data_id col1 col2 col3
    1 313 1 2 3
    2 46 4 5 null
    2 146 4 5 1
    2 246 4 5 2
    2 346 4 5 3
    ... ... ... ... ...
    3 707 null 6 7
    3 717 1 6 7
    3 727 2 6 7
    3 737 3 6 7
    ... ... ... ... ...

    See this db<>fiddle for a demo of both techniques, including the generated execution plans.

    The original posted queries also produce the same results, but at a much greater cost caused by multiple table scans.

    Bonus:

    If to have an option to code this up as a LINQ query, the following C# code might work (untested):

    var query = SearchCriteria
        .Select(sc => {
            var innerQuery = context.DataToSearch.AsQueryable();
            if (sc.col1 != null)
            {
                innerQuery = innerQuery .Where(d => d.col1 == sc.col1)
            }
            if (sc.col2 != null)
            {
                innerQuery = innerQuery .Where(d => d.col2 == sc.col2)
            }
            if (sc.col3 != null)
            {
                innerQuery = innerQuery .Where(d => d.col3 == sc.col3)
            }
            return innerQuery ;
        })
        .Aggregate((l, r) => l.Concat(r));  // Use .Union() to dedup
    

    If you wish to dedup the results, the .Concat() can be changed to .Union() above. (I think this will be executed on the SQL server side. If not, you might need to define a comparator so that actual values, not references, are compared.)