It is clear from this question that OR
conditions in joins in SQL are likely to be inefficient because they require nested loops and can not be optimised as a hash or merge join. The recommend solution is to replace with UNION
statements.
The query I am trying to optimise currently contains a series of OR
conditions of the form: a.col = b.col OR a.col IS NULL
. For a million rows, we are estimating runtimes measured in hours.
However, as there could be a dozen of these conditions, converting this into the equivalent UNION
statements is likely to be cumbersome - with each OR
condition doubling the number of subqueries that have to be UNION
ed. So I am keen to find a better alternative.
An example of what I am currently working with:
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON (a.col1 = b.col1 OR a.col1 IS NULL)
AND (a.col2 = b.col2 OR a.col2 IS NULL)
AND (a.col3 = b.col3 OR a.col3 IS NULL)
I have considered using COALESCE
in the join, something like the following. This removes OR
from the condition, but I expect the need to calculate the value of the COALESCE
is also going to prevent optimising the join.
SELECT *
FROM #table_a AS a
LEFT JOIN #table_b AS b
ON COALESCE(a.col1, b.col1, -1) = COALESCE(b.col1, -1)
AND COALESCE(a.col2, b.col2, -1) = COALESCE(b.col2, -1)
AND COALESCE(a.col3, b.col3, -1) = COALESCE(b.col3, -1)
How can I approach optimising this join pattern?
A few clarifying details in response to comments:
a
cols are null. In most instances I would expect that every a
col contains some nulls and every a
row contains some nulls, but that no a
col or row is completely null.For the following answer, I have renamed your tables as SearchCriteria
and DataToSearch
for improved readability.
I am assuming that SearchCriteria
contains one or a few rows, while DataToSearch
contains many rows. It is also assumed that appropriate indexes are defined on the DataToSearch
table to support reasonably efficient lookups for various combinations of criteria.
You can generate dynamic SQL (one query per criteria row) that applies just the relevant filter conditions. Non-null values generate a filter condition that is added to the WHERE
clause. Null search values are skipped. The results from all of the generated queries can then be combined using UNION ALL
.
DECLARE @NL CHAR = CHAR(10) -- newline
DECLARE @UnionAll NVARCHAR(100) = @NL + 'UNION ALL' + @NL
DECLARE @sql NVARCHAR(MAX) = (
SELECT STRING_AGG(Q.Sql, @UnionAll) WITHIN GROUP(ORDER BY S.search_id)
--SELECT *
FROM SearchCriteria S
CROSS APPLY (
SELECT CONCAT(
CAST('' AS NVARCHAR(MAX)),
'SELECT ', S.search_id, ' AS search_id, D.data_id, D.col1, D.col2, D.col3',
@NL, 'FROM DataToSearch D',
@NL, 'WHERE 1 = 1',
-- The following lines will either generate a "AND condition" line
-- or null (no condition) for cases where the search value is null.
-- Use the following for exact numeric values (excluding real/float)
-- If used for text values, the following is open to SQL Injection
(@NL + 'AND D.col1 = ' + CAST(S.col1 AS VARCHAR(30))),
(@NL + 'AND D.col2 = ' + CAST(S.col2 AS VARCHAR(30))),
(@NL + 'AND D.col3 = ' + CAST(S.col3 AS VARCHAR(30))),
-- Use the following for text values (limit 128 characters)
--(@NL + 'AND D.col1 = N' + QUOTENAME(S.col1, '''')),
--(@NL + 'AND D.col2 = N' + QUOTENAME(S.col2, '''')),
--(@NL + 'AND D.col3 = N' + QUOTENAME(S.col3, '''')),
-- Use the following for text values (possibly longer than 128 characters)
--(@NL + 'AND D.col1 = N''' + REPLACE(S.col1, '''', '''''') + ''''),
--(@NL + 'AND D.col2 = N''' + REPLACE(S.col2, '''', '''''') + ''''),
--(@NL + 'AND D.col3 = N''' + REPLACE(S.col3, '''', '''''') + ''''),
-- Use an appropriate mix of the above, if the columns have mixed types
''
) AS Sql
) Q
)
PRINT @sql
EXEC sp_executesql @sql
When a criteria column is null, the generated AND column = value
expression will also be null, and the CONCAT()
function will quietly skip that snippet.
Extreme care must be taken when injecting values into dynamic SQL to ensure that they are properly quoted and sanitized embedded quotes doubled up). Integer and other numeric values are safe as is, but text values must be carefully quoted and any embedded quotes doubled up. (Example "O'Malley" becomes N'O''Malley'
.) If you have any floating point values (SQL REAL
or FLOAT
types), you would need to cast them using CONVERT(VARCHAR(30), value, 3)
to ensure maximum precision is preserved.
An safer alternative is to use a cursor loop to generate and execute multiple parameterized queries that each add their results to a temp #Results
table. The final results are then selected from that temp table.
CREATE TABLE #Results (
search_id INT,
data_id INT,
col1 INT,
col2 INT,
col3 INT,
)
DECLARE CSR CURSOR FAST_FORWARD FOR
SELECT *
FROM SearchCriteria
ORDER BY search_id
OPEN CSR
DECLARE @search_id INT, @col1 INT, @col2 INT, @col3 INT
FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
WHILE @@FETCH_STATUS = 0
BEGIN
DECLARE @sql NVARCHAR(MAX) = CONCAT(
CAST('' AS NVARCHAR(MAX))
, 'INSERT INTO #Results'
, CHAR(10) + 'SELECT @search_id, D.data_id, D.col1, D.col2, D.col3'
, CHAR(10) + 'FROM DataToSearch D'
, CHAR(10) + 'WHERE 1 = 1'
, CASE WHEN @col1 IS NOT NULL THEN CHAR(10) + 'AND D.col1 = @col1' END
, CASE WHEN @col2 IS NOT NULL THEN CHAR(10) + 'AND D.col2 = @col2' END
, CASE WHEN @col3 IS NOT NULL THEN CHAR(10) + 'AND D.col3 = @col3' END
)
DECLARE @params NVARCHAR(MAX) = '@search_id INT, @col1 INT, @col2 INT, @col3 INT'
PRINT @sql
EXEC sp_executesql @sql, @params, @search_id, @col1, @col2, @col3
FETCH NEXT FROM CSR INTO @search_id, @col1, @col2, @col3
END
CLOSE CSR
DEALLOCATE CSR
SELECT *
FROM #Results
DROP TABLE #Results
Both of the above should perform index seeks for nearly all cases, depending on index availability. Cursor and temp table overhead in the second case should be minimal.
UNION ALL
in the above code can be changed to a simple UNION
if you wish to eliminate duplicate results. (The search_id
select item should also then be removed from the SQL for this to be effective.)
Given the following SearchCriteria data:
search_id | col1 | col2 | col3 |
---|---|---|---|
1 | 1 | 2 | 3 |
2 | 4 | 5 | null |
3 | null | 6 | 7 |
The following Dynamic SQL is generated:
SELECT 1 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 1
AND D.col2 = 2
AND D.col3 = 3
UNION ALL
SELECT 2 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = 4
AND D.col2 = 5
UNION ALL
SELECT 3 AS search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = 6
AND D.col3 = 7
The second version above generates the following sequence of queries, executing each using appropriate parameter values:
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2
AND D.col3 = @col3
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col1 = @col1
AND D.col2 = @col2
INSERT INTO #Results
SELECT @search_id, D.data_id, D.col1, D.col2, D.col3
FROM DataToSearch D
WHERE 1 = 1
AND D.col2 = @col2
AND D.col3 = @col3
Using some sample data-to-search that includes a mix of null and non-null values, the following results are obtained:
search_id | data_id | col1 | col2 | col3 |
---|---|---|---|---|
1 | 313 | 1 | 2 | 3 |
2 | 46 | 4 | 5 | null |
2 | 146 | 4 | 5 | 1 |
2 | 246 | 4 | 5 | 2 |
2 | 346 | 4 | 5 | 3 |
... | ... | ... | ... | ... |
3 | 707 | null | 6 | 7 |
3 | 717 | 1 | 6 | 7 |
3 | 727 | 2 | 6 | 7 |
3 | 737 | 3 | 6 | 7 |
... | ... | ... | ... | ... |
See this db<>fiddle for a demo of both techniques, including the generated execution plans.
The original posted queries also produce the same results, but at a much greater cost caused by multiple table scans.
Bonus:
If to have an option to code this up as a LINQ query, the following C# code might work (untested):
var query = SearchCriteria
.Select(sc => {
var innerQuery = context.DataToSearch.AsQueryable();
if (sc.col1 != null)
{
innerQuery = innerQuery .Where(d => d.col1 == sc.col1)
}
if (sc.col2 != null)
{
innerQuery = innerQuery .Where(d => d.col2 == sc.col2)
}
if (sc.col3 != null)
{
innerQuery = innerQuery .Where(d => d.col3 == sc.col3)
}
return innerQuery ;
})
.Aggregate((l, r) => l.Concat(r)); // Use .Union() to dedup
If you wish to dedup the results, the .Concat()
can be changed to .Union()
above. (I think this will be executed on the SQL server side. If not, you might need to define a comparator so that actual values, not references, are compared.)