Search code examples
sql-serveroptimizationquery-optimizationunion

Enhance MS SQL Server performance on UNION operations


I have a MS SQL Server table T1 with three code columns: CodeA, CodeB1 and CodeB2. And I have a table type variable with the same columns.

I must join the table variable with the T1 table in order to get the lines in table variable that match for CodeB1 and/or CodeB2, but not CodeA, or match CodeA, but neither CodeB1 nor CodeB2.

I initially I did a single SELECT statement like that:

SELECT *
(SELECT 
 CASE WHEN t.CodeA = v.CodeA then 1 else 0 end as [EqualCodeA],
 CASE WHEN t.CodeB1 = v.CodeB1 then 1 else 0 end as [EqualCodeB1],
 CASE WHEN t.CodeB2 = v.CodeB2 then 1 else 0 end as [EqualCodeB2]
FROM @tableVariable v
INNER JOIN [T1] t
ON t.CodeA = v.CodeA or
   t.CodeB1 = v.CodeB1 or
   t.CodeB2 = v.CodeB2
)
WHERE NOT(EqualCodeA = 1 AND (EqualCodeB1 = 1 OR EqualCodeB2 = 1)

But that query has very poor performance. So I switched the OR in predicate for UNION, like that:

SELECT *
(SELECT 
 CASE WHEN t.CodeA = v.CodeA then 1 else 0 end as [EqualCodeA],
 CASE WHEN t.CodeB1 = v.CodeB1 then 1 else 0 end as [EqualCodeB1],
 CASE WHEN t.CodeB2 = v.CodeB2 then 1 else 0 end as [EqualCodeB2]
 FROM @tableVariable v
 INNER JOIN [T1] t
 ON t.CodeA = v.CodeA

 UNION
 SELECT 
 CASE WHEN t.CodeA = v.CodeA then 1 else 0 end as [EqualCodeA],
 CASE WHEN t.CodeB1 = v.CodeB1 then 1 else 0 end as [EqualCodeB1],
 CASE WHEN t.CodeB2 = v.CodeB2 then 1 else 0 end as [EqualCodeB2]
 FROM @tableVariable v
 INNER JOIN [T1] t
 ON t.CodeB1 = v.CodeB1

 UNION
 SELECT 
 CASE WHEN t.CodeA = v.CodeA then 1 else 0 end as [EqualCodeA],
 CASE WHEN t.CodeB1 = v.CodeB1 then 1 else 0 end as [EqualCodeB1],
 CASE WHEN t.CodeB2 = v.CodeB2 then 1 else 0 end as [EqualCodeB2]
 FROM @tableVariable v
 INNER JOIN [T1] t
 ON t.CodeB2 = v.CodeB2)
)
WHERE NOT(EqualCodeA = 1 AND (EqualCodeB1 = 1 OR EqualCodeB2 = 1)

The performance is now about ten times better, but it's still unacceptable. E.g. for 10K rows in table variable and 50K rows in T1, the query takes two minutes.

Looking at the actual execution plan, I see two Hash Match (Union) operations, each accounting for 46% of the cost.

How could I improve this performance?

Note 1: there is a Non-clustered index including all three columns, as well as three separate non-clustered indexes, one for each column.

Note 2: I use the OPTION(RECOMPILE); in order for the optimizer at least to know the table variable actual number of rows.


Solution

  • To be honest I don't really see how the example code matches the description of your requirement, but assuming the latter is correct, this is what I came up with:

    -- in order to get the lines in table variable that match for CodeB1 and/or CodeB2, but not CodeA, 
    -- or match CodeA, but neither CodeB1 nor CodeB2.
    SELECT v.*
      FROM @tableVariable v
      JOIN T1 t
        ON (v.CodeA <> t.CodeA AND (v.CodeB1 = t.CodeB1 OR v.CodeB2 = t.CodeB2))
        OR (v.codeA = t.CodeA  AND v.CodeB1 <> t.CodeB1 AND v.codeB2 <> t.codeB2)
    
    
    GO
    
    -- convert OR into UNION
    SELECT v.*
      FROM @tableVariable v
      JOIN T1 t
        ON (v.CodeA <> t.CodeA AND (v.CodeB1 = t.CodeB1 OR v.CodeB2 = t.CodeB2))
    
    UNION
    
    SELECT *
      FROM @tableVariable v
      JOIN T1 t
        ON (v.codeA = t.CodeA  AND v.CodeB1 <> t.CodeB1 AND v.codeB2 <> t.codeB2)
    
    
    GO
    
    -- further convert OR into UNION
    SELECT v.*
      FROM @tableVariable v
      JOIN T1 t
        ON v.CodeA <> t.CodeA 
       AND v.CodeB2 = t.CodeB2
    
    UNION
    
    SELECT v.*
      FROM @tableVariable v
      JOIN T1 t
        ON v.CodeA <> t.CodeA 
       AND v.CodeB1 = t.CodeB1
    
    UNION
    
    SELECT v.*
      FROM @tableVariable v
      JOIN T1 t
        ON v.codeA = t.CodeA  
       AND v.CodeB1 <> t.CodeB1 
       AND v.codeB2 <> t.codeB2
    
    -- potentially helpfull indexes
    CREATE INDEX idx1 ON T1 (CodeA) INCLUDE (CodeB1, CodeB2)
    CREATE INDEX idx2 ON T1 (CodeB1) INCLUDE (CodeA)
    CREATE INDEX idx3 ON T1 (CodeB2) INCLUDE (CodeA)
    

    It might well be this has the exact same cost as your solution, the optimizer might be able to (internally) convert them into the exact same action. It'd be interesting to see the query-plans and/or get a better idea about the data at hand.

    PS: and as already mentioned elsewhere, try to avoid @tableVariables, #tempTables are much better when handling more than a few records. (you can put indexes, statistics etc... as needed)