Is it possible to recursively call the REPLACE function, without recursion, looping, dynamic SQL, UFNs?
Update: Re-wrote the entire question.
I'd like to be able to select a string with all the [search_expression] words removed in one quick [preferably set-based] action.
This works fine here, but if I have 500K rows it'll be a different story:
IF OBJECT_ID('TempDB..##TempStringsToClean') IS NOT NULL
DROP TABLE ##TempStringsToClean;
GO
CREATE TABLE ##TempStringsToClean (
[ID] INT
,[TheString] VARCHAR(128)
,[ExprNum] INT
,[SearchExpression] VARCHAR(64)
);
INSERT INTO ##TempStringsToClean
SELECT *
FROM (VALUES
(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Brown' )
,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'Jump' )
,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Over' )
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Quick' )
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'F' )
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Lazy' )
,(3, 'Order of operations is important for the Business Unit "BU".', 0, 'BU')
,(3, 'Order of operations is important for the Business Unit "BU".', 1, 'Business Unit')
,(4, 'Order of operations is important for the Business Unit "BU".', 0, 'Business Unit')
,(4, 'Order of operations is important for the Business Unit "BU".', 1, 'BU')
) AS VALUE([ID],[TheString],[ExprNum],[SearchExpression])
;
;WITH RecurCTE AS (
SELECT
TS.[ID]
,TS.[TheString]
,TS.[ExprNum]
,TS.[SearchExpression]
,[NewString] = REPLACE(TS.[TheString], TS.[SearchExpression], '')
,[Level] = 1
FROM ##TempStringsToClean TS
WHERE [ExprNum] = 0
UNION ALL
SELECT
TS.[ID]
,TS.[TheString]
,TS.[ExprNum]
,TS.[SearchExpression]
,[NewString] = REPLACE(R.[NewString], TS.[SearchExpression], '')
,[Level] = R.[Level] + 1
FROM ##TempStringsToClean TS
INNER JOIN RecurCTE R ON TS.[ID] = R.[ID]
AND TS.[ExprNum] = R.[ExprNum]+1
WHERE TS.[ExprNum] > 0
)
SELECT
[ID]
,[TheString]
,[Replaced] = STRING_AGG([SearchExpression], ', ') WITHIN GROUP (ORDER BY [ExprNum])
,[NewString] = MAX([NewString])
,[Level] = MAX([Level])
FROM RecurCTE R
GROUP BY
[ID]
,[TheString]
ORDER BY [ID]
,[Level]
;
It produces the desired result:
I know I could write a quick UFN to parse a list of strings and loop over the replaces, or worse yet a cursor, or dynamic SQL, or someone will say "use the Python xxxxxx lib!". The original question mentioned it is/was part of a string similarity, so I want to remove whole words. For anyone who might wind up in the future, I want to elaborate on @HABO's point:
Using the example strings:
'Mike was cleaning the lawnchair' <-- pretend there are more words matching between the two that make add to exactly 100 characters
'Mike was leaning on the lawnchair'
The [NewString] values would be 'c' and 'on' respectively. Equally weighting them we'll call it 98.5% similar. BUT, it's important to sort and replace the larger words, especially ones containing smaller words, first. Ie: replace 'cleaning', and then 'lean'.
Here's a couple more that also work, I thought nested replaces might save me the overhead of splitting 500K rows * their possible [search_expressions]:
/*
intersect/except try
Works but I don't appreciate the new string not being agged back into it's original position.
Not a requirement, but it helps when validating.
*/
;WITH SplitOriginalStringCTE AS (
SELECT
TC.[ID]
,TC.[TheString]
,[Word] = ORIGSTRING.[value]
FROM ##TempStringsToClean TC
CROSS APPLY string_split([TheString], ' ') ORIGSTRING
WHERE TC.[ExprNum] = 0
)
,SearchExpressionsCTE AS (
SELECT
TC.[ID]
,TC.[TheString]
,[Word] = TC.[SearchExpression]
FROM ##TempStringsToClean TC
)
SELECT
[ID]
,[TheString]
,[NewString] = STRING_AGG([Word], ' ') --WITHIN GROUP (ORDER BY [Rn])
FROM (
SELECT *
FROM SplitOriginalStringCTE O
EXCEPT
SELECT *
FROM SearchExpressionsCTE N
) X
GROUP BY
[ID]
,[TheString]
ORDER BY
[ID]
,[TheString]
/*
Use cross apply.
*/
;WITH SplitOriginalStringCTE AS (
SELECT
TC.[ID]
,TC.[TheString]
,[Word] = ORIGSTRING.[value]
--SQL Server 2019, SSMS shows the optional [ordinal] column as an intellisense column, but doesn't exist in 2019, docs say only in Azure.
--Might be included in SQL Server 2022. Relying on chance to hope they're in the right order.
,[Rn] = ROW_NUMBER() OVER (ORDER BY [ID])
FROM ##TempStringsToClean TC
CROSS APPLY string_split([TheString], ' ') ORIGSTRING
WHERE TC.[ExprNum] = 0
)
SELECT
[ID]
,[TheString]
,[NewString] = STRING_AGG([Word], ' ') WITHIN GROUP (ORDER BY [Rn])
FROM SplitOriginalStringCTE OS
WHERE NOT EXISTS (
SELECT 1
FROM ##TempStringsToClean EX
WHERE OS.[ID] = EX.[ID]
AND OS.[Word] = EX.[SearchExpression]
)
GROUP BY
[ID]
,[TheString]
ORDER BY
[ID]
;
You might be able to use the quirky update to do the nested replace, of course this might not work, the general idea:
DECLARE @r nvarchar(max)
SELECT *
, cast(NULL AS nvarchar(max)) AS str_new
INTO #
FROM (
VALUES(
1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Brown')
,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'Jump')
,(1, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Over')
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 0, 'Quick')
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 1, 'F')
,(2, 'The Quick Brown Fox Jumps Over the Lazy Dog', 2, 'Lazy')
) x(id, str, sort, repl)
UPDATE t
SET @r = replace(CASE WHEN sort = 0 THEN str ELSE @r END, repl, '')
, str_new = @r
FROM # t
Usually one need the table to be clusteredly indexed by the sort order you want to use, plus you need some luck, but you can read more about in on https://www.sqlservercentral.com/articles/solving-the-running-total-and-ordinal-rank-problems-rewritten