Search code examples
sqlsql-servert-sqlsql-server-2016

Why does swapping out a = comparison for an unused IN slow down this T-SQL query so much?


Apologies but I can't offer a reproducible example for this. It won't reproduce with a simple example and thus must be related to our data structures and volume: I'm hoping someone can see a pattern here and offer some advice.

We previously had a stored procedure, which felt oddly written to me but worked fine, which ran code equivalent to the following pseudocode:

DECLARE @HasResults BIT = 0;

IF 
    (SELECT COUNT(*) FROM myTable t
    WHERE
        t.field1 = @param1
        OR t.field2 = @param2
        OR t.field3 = @param3
        OR t.field4 = @param4) > 0
    SET @HasResults = 1

SELECT @HasResults AS HasResults

All of the fields and params were originally integers and in normal usage all but one of them will be NULL. I had to change one of the params to an nvarchar(max) so that it would take a list, which I split with a fairly standard splitting function and then use an IN statement:

DECLARE @HasResults BIT = 0;

IF 
    (SELECT COUNT(*) FROM myTable t
    WHERE
        t.field1 = @param1
        OR t.field2 = @param2
        OR t.field3 IN (select ID from fnSplit(@param3))
        OR t.field4 = @param4) > 0
    SET @HasResults = 1

SELECT @HasResults AS HasResults 

This resulted in the query, in some circumstances, going from a sub-second to over a minute. Now you might expect that from an IN comparison but what baffles me is that if there's data in @param3 it works fine - it's if @param3 is NULL that the query is slow. If you comment out the IN clause, it goes back to sub-second speed.

The splitting function isn't the problem here - it's very fast and I've experimented with it but nothing improves.

To further confuse me I discovered that you can significantly improve the situation by removing that unnecessary IF statement. This takes about 10 seconds to run, which is much slower than the original query but much faster than using the IF:

SELECT COUNT(*) FROM myTable t
    WHERE
        t.field1 = @param1
        OR t.field2 = @param2
        OR t.field3 = @param3
        OR t.field4 = @param4

Why is this query running so much slower when I try and split a NULL value and use the results in an IN statement, and why does the IF have such an impact?

EDIT: split function as requested:

CREATE FUNCTION [dbo].[fnSplit](
    @sInputList VARCHAR(MAX) 
  , @sDelimiter VARCHAR(MAX) = ','
) RETURNS @List TABLE (item VARCHAR(MAX))

BEGIN
DECLARE @sItem VARCHAR(MAX)
WHILE CHARINDEX(@sDelimiter,@sInputList,0) <> 0
 BEGIN
 SELECT  
 @sItem=RTRIM(LTRIM(SUBSTRING(@sInputList,1,CHARINDEX(@sDelimiter,@sInputList,0)-1))),
  @sInputList=RTRIM(LTRIM(SUBSTRING(@sInputList,CHARINDEX(@sDelimiter,@sInputList,0)+LEN(@sDelimiter),LEN(@sInputList))))

 IF LEN(@sItem) > 0
  INSERT INTO @List SELECT @sItem
 END

IF LEN(@sInputList) > 0
 INSERT INTO @List SELECT @sInputList
RETURN
END
GO

Solution

  • Understanding why a scalaire UDF or a multi-instruction table can kill performances is very simple.

    UDF are Transact SQL that is a interpreted language. Not a compiled one. So the function will be call on every row. This had 3 consequences :

    • executing the function for every rows (RBAR effet)
    • forbid any parallel processing because of potential side effects
    • forbid the use of indexes, because indexes cannot be used when data is transformed

    So if you want performances, avoid using UDF when there is another solution.

    You can use instead STRING_SPLIT which I think is the fastest or XML operations.

    In fact, in queries, that operates as "sets" and not on every values, any use of anything that have an iterative process will kill all performances. Recursive queries includes....