Search code examples
sqlsql-servermaskingsql-masking

SQL SERVER generate data using Regex pattern


I would like to generate the data by given regex pattern in SQL Server. Is there any possibility to do? Say, I have pattern as below and I would like to generate data as follow:

The idea behind the concept is SQL STATIC DATA MASKING (which was removed in current feature). Our client wants to mask the production data in test database. We don't have SQL STATIC DATA MASKING feature with sql now but we have patterns to mask the column, so what I am thinking is, with these pattern we can run the update query.

SELECT "(\d){7}" AS RandonNumber, "(\W){5}" AS RandomString FROM tbl

Output Should be

  +---------------+--------------+
  |  RandonNumber | RandomString |
  +---------------+--------------+
  |  7894562      | AHJIL        |
  +---------------+--------------+
  |  9632587      | ZLOKP        |
  +---------------+--------------+
  |  4561238      | UJIOK        |
  +---------------+--------------+

Apart from this regular pattern, I have some customized pattern like Test_Product_(\d){1,4}, which should give result as below:

Test_Product_012 
Test_Product_143
Test_Product_8936

Complete Patterns which I am going to use for masking

Other Patterns                Samples
(\l){30}                      ahukoklijfahukokponmahukoahuko
(\d){7}                       7895623
(\W){5}                       ABCDEF
Test_Product_(\d){1,4}        Test_Product_007
0\.(\d){2}                    0.59
https://www\.(\l){10}\.com    https://www.anything.com

Solution

  • Well, I can give you a solution that is not based on regular expressions, but on a set of parameters - but it contains a complete set of all your requirements.
    I've based this solution on a user-defined function I've written to generate random strings (You can read my blog post about it here) - I've just changed it so that it could generate the mask you wanted based on the following conditions:

    • The mask has an optional prefix.
    • The mask has an optional suffix.
    • The mask has a variable-length random string.
    • The random string can contain either lower-case letters, upper-case letters, digits, or any combination of the above.

    I've decided these set of rules based on your update to the question, containing your desired masks:

    (\d){7}                       7895623
    (\W){5}                       ABCDEF
    Test_Product_(\d){1,4}        Test_Product_007
    0\.(\d){2}                    0.59
    https://www\.(\l){10}\.com    https://www.anything.com
    

    And now, for the code:
    Since I'm using a user-defined function, I can't use inside it the NewId() built in function - so we first need to create a view to generate the guid for us:

    CREATE VIEW GuidGenerator
    AS
        SELECT Newid() As NewGuid;
    

    In the function, we're going to use that view to generate a NewID() as the base of all randomness.

    The function itself is a lot more cumbersome then the random string generator I've started from:

    CREATE FUNCTION dbo.MaskGenerator
    (
        -- use null or an empty string for no prefix
        @Prefix nvarchar(4000), 
        -- use null or an empty string for no suffix
        @suffix nvarchar(4000), 
        -- the minimum length of the random part
        @MinLength int, 
        -- the maximum length of the random part
        @MaxLength int, 
        -- the maximum number of rows to return. Note: up to 1,000,000 rows
        @Count int, 
        -- 1, 2 and 4 stands for lower-case, upper-case and digits. 
        -- a bitwise combination of these values can be used to generate all possible combinations:
        -- 3: lower and upper, 5: lower and digis, 6: upper and digits, 7: lower, upper nad digits
        @CharType tinyint 
    )
    RETURNS TABLE
    AS 
    RETURN 
    -- An inline tally table with 1,000,000 rows
    WITH E1(N) AS (SELECT N FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) V(N)),   -- 10
         E2(N) AS (SELECT 1 FROM E1 a, E1 b), --100
         E3(N) AS (SELECT 1 FROM E2 a, E2 b), --10,000
         Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY @@SPID) FROM E3 a, E2 b) --1,000,000
    
    SELECT TOP(@Count) 
            n As Number, 
            CONCAT(@Prefix, (
            SELECT  TOP (Length) 
                    -- choose what char combination to use for the random part
                    CASE @CharType 
                        WHEN 1 THEN Lower
                        WHEN 2 THEN Upper
                        WHEN 3 THEN IIF(Rnd % 2 = 0, Lower, Upper)
                        WHEN 4 THEN Digit
                        WHEN 5 THEN IIF(Rnd % 2 = 0, Lower, Digit)
                        WHEN 6 THEN IIF(Rnd % 2 = 0, Upper, Digit)
                        WHEN 7 THEN 
                            CASE Rnd % 3
                                WHEN 0 THEN Lower
                                WHEN 1 THEN Upper
                                ELSE Digit
                            END
                    END
            FROM Tally As t0  
            -- create a random number from the guid using the GuidGenerator view
            CROSS APPLY (SELECT Abs(Checksum(NewGuid)) As Rnd FROM GuidGenerator) As rand
            CROSS APPLY
            (
                -- generate a random lower-case char, upper-case char and digit
                SELECT  CHAR(97 + Rnd % 26) As Lower, -- Random lower case letter
                        CHAR(65 + Rnd % 26) As Upper,-- Random upper case letter
                        CHAR(48 + Rnd % 10) As Digit -- Random digit
            ) As Chars
            WHERE  t0.n <> -t1.n -- Needed for the subquery to get re-evaluated for each row
            FOR XML PATH('') 
            ), @Suffix) As RandomString
    FROM Tally As t1
    CROSS APPLY
    (
        -- Select a random length between @MinLength and @MaxLength (inclusive)
        SELECT TOP 1 n As Length
        FROM Tally As t2
        CROSS JOIN GuidGenerator 
        WHERE t2.n >= @MinLength
        AND t2.n <= @MaxLength
        AND t2.n <> t1.n
        ORDER BY NewGuid
    ) As Lengths;
    

    And finally, Test cases:

    (\l){30} - ahukoklijfahukokponmahukoahuko

    SELECT RandomString FROM dbo.MaskGenerator(null, null, 30, 30, 2, 1); 
    

    Results:

    1, eyrutkzdugogyhxutcmcmplvzofser
    2, juuyvtzsvmmcdkngnzipvsepviepsp
    

    (\d){7} - 7895623

    SELECT RandomString FROM dbo.MaskGenerator(null, null, 7, 7, 2, 4); 
    

    Results:

    1, 8744412
    2, 2275313
    

    (\W){5} - ABCDE

    SELECT RandomString FROM dbo.MaskGenerator(null, null, 5, 5, 2, 2); 
    

    Results:

    1, RSYJE
    2, MMFAA
    

    Test_Product_(\d){1,4} - Test_Product_007

    SELECT RandomString FROM dbo.MaskGenerator('Test_Product_', null, 1, 4, 2, 4); 
    

    Results:

    1, Test_Product_933
    2, Test_Product_7
    

    0\.(\d){2} - 0.59

    SELECT RandomString FROM dbo.MaskGenerator('0.', null, 2, 2, 2, 4); 
    

    Results:

    1, 0.68
    2, 0.70
    

    https://www\.(\l){10}\.com - https://www.anything.com

    SELECT RandomString FROM dbo.MaskGenerator('https://www.', '.com', 10, 10, 2, 1); 
    

    Results:

    1, https://www.xayvkmkuci.com
    2, https://www.asbfcvomax.com       
    

    Here's how you use it to mask the content of a table:

    DECLARE @Count int = 10; 
    
    SELECT  CAST(IntVal.RandomString As Int) As IntColumn, 
            UpVal.RandomString as UpperCaseValue, 
            LowVal.RandomString as LowerCaseValue, 
            MixVal.RandomString as MixedValue,
            WithPrefix.RandomString As PrefixedValue
    FROM dbo.MaskGenerator(null, null, 3, 7, @Count, 4) As IntVal
    JOIN dbo.MaskGenerator(null, null, 10, 10, @Count, 1) As LowVal
        ON IntVal.Number = LowVal.Number
    JOIN dbo.MaskGenerator(null, null, 5, 10, @Count, 2) As UpVal
        ON IntVal.Number = UpVal.Number
    JOIN dbo.MaskGenerator(null, null, 10, 20, @Count, 7) As MixVal
        ON IntVal.Number = MixVal.Number
    JOIN dbo.MaskGenerator('Test ', null, 1, 4, @Count, 4) As WithPrefix
        ON IntVal.Number = WithPrefix.Number
    

    Results:

    IntColumn   UpperCaseValue  LowerCaseValue  MixedValue              PrefixedValue
    674         CCNVSDI         esjyyesesv      O2FAC7bfwg2Be5a91Q0     Test 4935
    30732       UJKSL           jktisddbnq      7o8B91Sg1qrIZSvG3AcL    Test 0
    4669472     HDLJNBWPJ       qgtfkjdyku      xUoLAZ4pAnpn            Test 8
    26347       DNAKERR         vlehbnampb      NBv08yJdKb75ybhaFqED    Test 91
    6084965     LJPMZMEU        ccigzyfwnf      MPxQ2t8jjmv0IT45yVcR    Test 4
    6619851     FEHKGHTUW       wswuefehsp      40n7Ttg7H5YtVPF         Test 848
    781         LRWKVDUV        bywoxqizju      UxIp2O4Jb82Ts           Test 6268
    52237       XXNPBL          beqxrgstdo      Uf9j7tCB4W2             Test 43
    876150      ZDRABW          fvvinypvqa      uo8zfRx07s6d0EP         Test 7
    

    Note that this is a fast process - generating 1000 rows with 5 columns took less than half a second on average in tests I've made.