Search code examples
sql-servert-sqlsql-server-2019

Recursively call the REPLACE function without recursion when the search_expression values are in subsequent rows?


Is it possible to recursively call the REPLACE function, without recursion, looping, dynamic SQL, UFNs?

Update: Re-wrote the entire question.

I'd like to be able to select a string with all the [search_expression] words removed in one quick [preferably set-based] action.

This works fine here, but if I have 500K rows it'll be a different story:

IF OBJECT_ID('TempDB..##TempStringsToClean') IS NOT NULL
    DROP TABLE ##TempStringsToClean;
GO

CREATE TABLE ##TempStringsToClean (
     [ID]                   INT
    ,[TheString]            VARCHAR(128)
    ,[ExprNum]              INT
    ,[SearchExpression]     VARCHAR(64)
);

INSERT INTO ##TempStringsToClean
SELECT *
FROM (VALUES
     (1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Brown'  )
    ,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'Jump'   )
    ,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Over'   )
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Quick'  )
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'F'      )
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Lazy'   )
    ,(3, 'Order of operations is important for the Business Unit "BU".', 0, 'BU')
    ,(3, 'Order of operations is important for the Business Unit "BU".', 1, 'Business Unit')
    ,(4, 'Order of operations is important for the Business Unit "BU".', 0, 'Business Unit')
    ,(4, 'Order of operations is important for the Business Unit "BU".', 1, 'BU')
) AS VALUE([ID],[TheString],[ExprNum],[SearchExpression])
;

;WITH RecurCTE AS (
    SELECT
         TS.[ID]                
        ,TS.[TheString]     
        ,TS.[ExprNum]           
        ,TS.[SearchExpression]
        ,[NewString]            = REPLACE(TS.[TheString], TS.[SearchExpression], '')
        ,[Level]                = 1
    FROM ##TempStringsToClean TS
    WHERE [ExprNum] = 0
    
    UNION ALL

    SELECT
         TS.[ID]                
        ,TS.[TheString]     
        ,TS.[ExprNum]           
        ,TS.[SearchExpression]
        ,[NewString]            = REPLACE(R.[NewString], TS.[SearchExpression], '')
        ,[Level]                = R.[Level] + 1
    FROM ##TempStringsToClean TS
    INNER JOIN RecurCTE R ON TS.[ID] = R.[ID]
                            AND TS.[ExprNum] = R.[ExprNum]+1
    WHERE TS.[ExprNum] > 0
)


SELECT
     [ID]
    ,[TheString]
    ,[Replaced]     = STRING_AGG([SearchExpression], ', ') WITHIN GROUP (ORDER BY [ExprNum])
    ,[NewString]    = MAX([NewString])
    ,[Level]        = MAX([Level])
FROM RecurCTE R
GROUP BY
     [ID]
    ,[TheString]
ORDER BY [ID]
    ,[Level]
;

It produces the desired result: Image of Results

I know I could write a quick UFN to parse a list of strings and loop over the replaces, or worse yet a cursor, or dynamic SQL, or someone will say "use the Python xxxxxx lib!". The original question mentioned it is/was part of a string similarity, so I want to remove whole words. For anyone who might wind up in the future, I want to elaborate on @HABO's point:

Using the example strings:
'Mike was cleaning the lawnchair' <-- pretend there are more words matching between the two that make add to exactly 100 characters
'Mike was leaning on the lawnchair'

The [NewString] values would be 'c' and 'on' respectively. Equally weighting them we'll call it 98.5% similar. BUT, it's important to sort and replace the larger words, especially ones containing smaller words, first. Ie: replace 'cleaning', and then 'lean'.

Here's a couple more that also work, I thought nested replaces might save me the overhead of splitting 500K rows * their possible [search_expressions]:


/*
    intersect/except try
    Works but I don't appreciate the new string not being agged back into it's original position.
    Not a requirement, but it helps when validating.
*/
;WITH SplitOriginalStringCTE AS (
    SELECT
         TC.[ID]
        ,TC.[TheString]
        ,[Word]             = ORIGSTRING.[value]
    FROM ##TempStringsToClean TC
    CROSS APPLY string_split([TheString], ' ') ORIGSTRING
    WHERE TC.[ExprNum] = 0
)

,SearchExpressionsCTE AS (
    SELECT
         TC.[ID]
        ,TC.[TheString]
        ,[Word]             = TC.[SearchExpression]
    FROM ##TempStringsToClean TC
)


SELECT
     [ID]
    ,[TheString]
    ,[NewString]        = STRING_AGG([Word], ' ') --WITHIN GROUP (ORDER BY [Rn])
FROM (
    SELECT *
    FROM SplitOriginalStringCTE O
    
    EXCEPT
    
    SELECT *
    FROM SearchExpressionsCTE N
) X
GROUP BY
     [ID]
    ,[TheString]
ORDER BY
     [ID]
    ,[TheString]


/*
    Use cross apply.
*/
;WITH SplitOriginalStringCTE AS (
    SELECT
         TC.[ID]
        ,TC.[TheString]
        ,[Word]             = ORIGSTRING.[value]
        --SQL Server 2019, SSMS shows the optional [ordinal] column as an intellisense column, but doesn't exist in 2019, docs say only in Azure.
        --Might be included in SQL Server 2022.  Relying on chance to hope they're in the right order.
        ,[Rn]           = ROW_NUMBER() OVER (ORDER BY [ID])
    FROM ##TempStringsToClean TC
    CROSS APPLY string_split([TheString], ' ') ORIGSTRING
    WHERE TC.[ExprNum] = 0
)


SELECT
     [ID]
    ,[TheString]
    ,[NewString]        = STRING_AGG([Word], ' ') WITHIN GROUP (ORDER BY [Rn])
FROM SplitOriginalStringCTE OS
WHERE NOT EXISTS (
            SELECT 1
            FROM ##TempStringsToClean EX
            WHERE OS.[ID] = EX.[ID]
                AND OS.[Word] = EX.[SearchExpression]
    )
GROUP BY
     [ID]
    ,[TheString]
ORDER BY
     [ID]
;

Solution

  • You might be able to use the quirky update to do the nested replace, of course this might not work, the general idea:

    DECLARE @r nvarchar(max)
    
    SELECT  *
    , cast(NULL AS nvarchar(max)) AS str_new
    INTO #
    FROM    (
    VALUES(
    1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Brown')
    ,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'Jump')
    ,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Over')
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Quick')
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'F')
    ,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Lazy')
    ) x(id, str, sort, repl)
    
    UPDATE  t
    SET @r = replace(CASE WHEN sort = 0 THEN str ELSE @r END, repl, '')
    ,   str_new = @r
    FROM    # t
    

    Usually one need the table to be clusteredly indexed by the sort order you want to use, plus you need some luck, but you can read more about in on https://www.sqlservercentral.com/articles/solving-the-running-total-and-ordinal-rank-problems-rewritten