Search code examples
sqlsql-serverperformancequery-optimizationcpu-usage

SQL SERVER - Query optimization 'like' causing most cpu uses 100%


I have two tables in database Products and Filters.

The schema:

enter image description here

I have created a query that find all the records from the filters table, loop with each record and call a procedure that set the category id for Products table.

Filter table data will be as follow.

enter image description here

The filter selection query is as follow..

DECLARE @TotalRecords INT, @Start INT, @Limit INT, @CatId INT, @Merchants NVARCHAR(max), @NotMatch NVARCHAR(max), @WillMatch NVARCHAR(max);
SELECT @TotalRecords = COUNT(*) FROM filters;

SET @Limit = 1;
SET @Start = 0;

WHILE(@TotalRecords > 0)
BEGIN       
    SELECT @CatId = category_id, @Merchants = merchant_name, @NotMatch = not_match, @WillMatch = will_match FROM 
    (
        SELECT TOP (@Start + @Limit) *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS rnum 
        FROM filters
    ) a
    WHERE rnum > @Start;

    -- call filter procedure.
    exec procSetProductCategory @CatId = @CatId, @Merchants = @Merchants, @WillMatch = @WillMatch, @NotMatch = @NotMatch;

    SET @Start += 1;
    SET @TotalRecords -= 1;
END

And the procSetProductCategory as follow..

CREATE PROC [dbo].[procSetProductCategory]
(
    @CatId INT = NULL,
    @Merchants NVARCHAR(max),
    @NotMatch NVARCHAR(max),
    @WillMatch NVARCHAR(max)
)
AS
BEGIN
SET NOCOUNT ON

    declare @query nvarchar(max), @orToken nvarchar(max), @andToken nvarchar(max);
     set @query = 'UPDATE Products SET category_id = '+ convert(nvarchar(20), @CatId) + ' WHERE category_id IS NULL AND merchant_name IN(' + @Merchants + ')';

    if(@WillMatch is not null AND LTRIM(RTRIM(@WillMatch)) != '')
    BEGIN

        set @andToken = '%'' AND product_name LIKE ''%';
        set @WillMatch = REPLACE(@WillMatch, '+', @andToken);

        set @orToken = '%'') OR (product_name LIKE ''%';
        set @query = @query + ' AND ((product_name LIKE '''+ '%' + REPLACE(@WillMatch, ',', @orToken) + '%''))';
    END

    if(@NotMatch is not null AND LTRIM(RTRIM(@NotMatch)) != '')
    BEGIN
        set @andToken = '%'' AND product_name NOT LIKE ''%';
        set @NotMatch = REPLACE(@NotMatch, '+', @andToken);

        set @orToken = '%'') OR (product_name NOT LIKE ''%';
        set @query = @query + ' AND ((product_name NOT LIKE '''+ '%' + REPLACE(@NotMatch, ',', @orToken) + '%''))';
    END

    EXECUTE sp_executesql @query;
END

It generates the sql query like following...

Query #1
-------------------------------------------------------------------------------------------------------
UPDATE Products SET category_id = 101 WHERE merchant_name IN('merchant 1','merchant 4','merchant 3') AND 
 (
    (product_name LIKE '%abcd%' AND product_name LIKE '%efhg%')
 ) AND (
    (product_name NOT LIKE '%3258%')
     OR (product_name NOT LIKE '%yxzs%')
)


Query #2
-------------------------------------------------------------------------------------------------------
UPDATE Products SET category_id = 102 WHERE merchant_name IN('merchant 3', 'merchant 4') AND 
(
    (product_name LIKE '%1258%') OR (product_name LIKE '%abcd%')
)

Note there are some trick used here.

[,] is used to differentiate match phrases. [+] in match fields used for two match phrases with AND conditions.

These query doing the same what I needed..

enter image description here

Issue is that when I run this query with 500 000 products its using about 100% CPU.

How can we optimize the query that doesn't take impact on result but can reduce the CPU usage?


Solution

  • For starters, as pointed out already: there really is something wrong with the logic here. That said, assuming you are stuck with it there are some things you might want to try. My first question would be: how long does this thing run? You shouldn't worry too much that it takes 100% CPU; the question is how much time it takes to finish.

    Query1:

    It seems that you are creating a loop over the the filters table, fetching every single row, one by one.

    • SQL isn't optimised to do row-by-row operations; you really should consider changing the logic to something set-based
    • If you really want to do something row by row, then please use a CURSOR and not the current approach.
      • First you go over the entire table to count how many filters there are
      • Then you go over the entire table and order the records by SELECT 1
      • Out of the sorted list you pick one that has rnum bigger than your counter

    => This is wrong in so many ways, it actually hurts =(

    • If you sort/order by SELECT 1 then it could return the records in the order ABCD the first time and BADC the second time; and both answers would be correct because you're sorting by a constant: the actual order of the records doesn't matter!
    • Each and every time you go through the loop, the server has to sort the entire table before it can tell which rnum values will fit the requirement of being greater than @start; EVERY TIME!
    • There will be many records that fit rnum > @start, the returned record being used to fill up the records could be any one of them!

    To 'fix' this I'd suggest to use the following approach:

    DECLARE @TotalRecords INT, 
            @Start INT, 
            @Limit INT, 
            @CatId INT, 
            @Merchants NVARCHAR(max), 
            @NotMatch NVARCHAR(max), 
            @WillMatch NVARCHAR(max);
    
    DECLARE filter_loop CURSOR LOCAL FAST_FORWARD
        FOR SELECT category_id, 
                   merchant_name,
                   not_match,
                   will_match
              FROM filters
             ORDER BY id -- not required but makes debugging easier
    OPEN filter_loop 
    FETCH NEXT FROM filter_loop INTO @CatId, @Merchants, @NotMatch, @WillMatch
    WHILE @@FETCH_STATUS = 0
        BEGIN
    
            -- call filter procedure.
            exec procSetProductCategory @CatId = @CatId, @Merchants = @Merchants, @WillMatch = @WillMatch, @NotMatch = @NotMatch;
    
            -- get next filter
            FETCH NEXT FROM filter_loop INTO @CatId, @Merchants, @NotMatch, @WillMatch
        END
    CLOSE filter_loop 
    DEALLOCATE filter_loop 
    

    Query2:

    At first sight there is very little I can do about the stored procedure itself. There is some dynamic sql string building that might be optimized a little bit but I very much doubt it will make much impact. As it is right now it's fairly readable, so I'd leave it as is. The generated query then indeed looks something like this:

    UPDATE Products 
       SET category_id = 101 
     WHERE merchant_name IN ('merchant 1','merchant 4','merchant 3') 
       AND ((product_name LIKE '%abcd%' AND product_name LIKE '%efhg%') ) 
       AND ((product_name NOT LIKE '%3258%') OR (product_name NOT LIKE '%yxzs%'))
    

    for which I'd advice to create the following index:

    CREATE INDEX idx_test ON Products (merchant_name) INCLUDE product_name)
    

    Afterthoughts

    Even with the changes above in place, this will still run for quite a while when working on 100k+ records. The only real solution around this would be to use a set-based approach, but would require either a gargantuan dynamic sql string; or some better knowledge about the data itself. E.g. You might try combining different Filters records that have the same Merchants value but different Match/NoMatch... Probably not too difficult, but I'd suggest to start with the suggestions above first and then see where you end up.