Search code examples
sqlsql-serverpatindexredaction

How to identify and redact all instances of a matching pattern in T-SQL


I have a requirement to run a function over certain fields to identify and redact any numbers which are 5 digits or longer, ensuring all but the last 4 digits are replaced with *

For example: "Some text with 12345 and 1234 and 12345678" would become "Some text with *2345 and 1234 and ****5678"

I've used PATINDEX to identify the the starting character of the pattern:

PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', TEST_TEXT)

I can recursively call that to get the starting character of all the occurrences, but I'm struggling with the actual redaction.

Does anyone have any pointers on how this can be done? I know to use REPLACE to insert the *s where they need to be, it's just the identification of what I should actually be replacing I'm struggling with.

Could do it on a program, but I need it to be T-SQL (can be a function if needed).

Any tips greatly appreciated!


Solution

  • You can do this using the built in functions of SQL Server. All of which used in this example are present in SQL Server 2008 and higher.

    DECLARE @String VARCHAR(500) = 'Example Input: 1234567890, 1234, 12345, 123456, 1234567, 123asd456'
    DECLARE @StartPos INT = 1, @EndPos INT = 1;
    DECLARE @Input VARCHAR(500) = ISNULL(@String, '') + ' '; --Sets input field and adds a control character at the end to make the loop easier.
    DECLARE @OutputString VARCHAR(500) = ''; --Initalize an empty string to avoid string null errors
    
    WHILE (@StartPOS <> 0)
    BEGIN
        SET @StartPOS = PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%', @Input);
        IF @StartPOS <> 0
        BEGIN
            SET @OutputString += SUBSTRING(@Input, 1, @StartPOS - 1); --Seperate all contents before the first occurance of our filter
            SET @Input = SUBSTRING(@Input, @StartPOS, 500); --Cut the entire string to the end. Last value must be greater than the original string length to simply cut it all.
    
            SET @EndPos = (PATINDEX('%[0-9][0-9][0-9][0-9][^0-9]%', @Input)); --First occurance of 4 numbers with a not number behind it.
            SET @Input = STUFF(@Input, 1, (@EndPos - 1), REPLICATE('*', (@EndPos - 1))); --@EndPos - 1 gives us the amount of chars we want to replace.
        END
    END
    SET @OutputString += @Input; --Append the last element
    
    SET @OutputString = LEFT(@OutputString, LEN(@OutputString))
    SELECT @OutputString;
    

    Which outputs the following:

    Example Input: ******7890, 1234, *2345, **3456, ***4567, 123asd456

    This entire code could also be made as a function since it only requires an input text.