Search code examples
t-sql

How can you generate possible permutations from a given string in tsql where you need multiple replace() calls for specific characters?


I have a column that contains license plates. The plate data somes from multiple sources and is not completely accurate. I must assume that zero's are misinterpreted as 'O' and vice versa. This includes multiple varying results in the same plate (OCR says 00OABC, when actual plate is 0O0ABC). To handle this, I need to build a where clause that uses all possible replacement values for every instance zero and the letter 'O'. How can this be done in tslq? Examples:

Plate, (where clause needed)

ABC012, (ABC012, ABCO12)
0OABC, (00ABC, OOABC, 0OABC, OOABC)
000XYZ, (000XYZ,00OXYZ,0OOXYZ,0O0XYZ,O00XYZ,O0OXYZ,OOOXYZ,OO0XYZ)

Edit: I need to build a where in clause, because I cannot do a replace in my select statement (example: select * from plates where replace(plate,'O','0') = '000XYZ'). I cannot do a replace because there is too much data. If this were a small table or if this were hadoop, it would be fine. What I have to work with is a MSSQL DB with a 20bn row table.


Solution

  • The solution to this problem revolves around defining character equivalences.

    Just as case-insensitive compares treat uppercase letters as equivalent to the corresponding lowercase letters, and accent-insensitive comparisons will treat "à", "á", "â", and others as equivalent to "a", we need to define character equivalences for similar-looking characters.

    Once we have done that, we can define functions for comparing plate numbers having similar appearances and to generate sets of similar appearing plates.

    For comparisons, I believe that most case-insensitive and accent-insensitive compares will map equivalent characters to a preferred form (such as all accent-free lowercase) before doing the comparison. We can do the same by defining a translation of all alternate characters to their equivalent base or preferred forms before doing any comparison.

    To generate sets of similar appearing plates, we can use a recursive CTE (common table expression) that steps through the plate number one character at a time, while mapping each character to all defined equivalent characters, including itself. Each character would be mapped independently, so the the final result includes all combinations (a cross product) of each character mapping.

    I've put together some sample code as follows:

    -- This table defines character mappings for similar looking characters.
    -- The BaseChar value is considered the primary for each equivalence set,
    -- while the AltChar values are the alternate characters that should be
    -- considered to the base character and to any alternates sharing the same
    -- base character.
    CREATE TABLE PlateCharacterMap (
        BaseChar CHAR,  -- Source character from reader or database
        AltChar CHAR     -- Equivalent character for comparison purposes
    )
    INSERT PlateCharacterMap
    VALUES
        ('O', '0'),
        ('I', '1'),
        ('S', '5'),
        ('O', 'Q')
    
    -- Also map any defined base charcaters to themselves
    -- to support later processing logic.
    INSERT PlateCharacterMap
    SELECT DISTINCT BaseChar, BaseChar
    FROM PlateCharacterMap
    

    To support direct compares between two potentially similar plate numbers, we need to define TranslateFrom and TranslateTo strings that can be used with the TRANSLATE() function.

    -- This single-row table is used to define translations for comparing any two
    -- plate numbers. All alternate characters will be mapped to their associated
    -- base characters.
    CREATE TABLE PlateTranslateStrings (
        TranslateFrom VARCHAR(1000),
        TranslateTo VARCHAR(1000)
    )
    INSERT PlateTranslateStrings
    SELECT
        STRING_AGG(AltChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateFrom,
        STRING_AGG(BaseChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateTo
    FROM PlateCharacterMap
    WHERE AltChar <> BaseChar
    

    The above would calculate TranslateFrom = '015Q' and TranslateTo = 'OISO'.

    Functions for (1) plate number mapping, (2) plate number comparison, and (3) plate number permutations follow:

    -- Map a plate number to a string suitable for fuzzy comparisons
    CREATE FUNCTION PlateCompareString(@PlateNumber VARCHAR(10))
    RETURNS VARCHAR(10)
    AS
    BEGIN
        RETURN (
            SELECT TRANSLATE(@PlateNumber, T.TranslateFrom, T.TranslateTo)
            FROM PlateTranslateStrings T
        )
    END
    
    -- Perform a fuzzy comparison of plate numbers
    CREATE FUNCTION PlateCompare(@Left VARCHAR(10), @Right VARCHAR(10))
    RETURNS BIT
    AS
    BEGIN
        RETURN CASE WHEN dbo.PlateCompareString(@Left) = dbo.PlateCompareString(@Right)
               THEN 1 ELSE 0 END
    END
    
    -- Generate a set of similar plates for a given plate number.
    CREATE FUNCTION PlatePermutations(@PlateNumber VARCHAR(10))
    RETURNS TABLE
    AS
    RETURN (
        WITH CTE_Permutate AS (
            -- Anchor - Seed with empty result.
            SELECT
                CAST('' AS VARCHAR(10)) AS Result,
                @PlateNumber AS Remaining
            UNION ALL
            -- Map next character to itself.
            SELECT
                CAST(P.Result + LEFT(P.Remaining, 1) AS VARCHAR(10)) AS Result,
                STUFF(P.Remaining, 1, 1, '') AS Remaining
            FROM CTE_Permutate P
            WHERE LEN(P.Remaining) > 0
            UNION ALL
            -- Also map next character to all characters sharing the same base character
            -- except itself.
            SELECT
                CAST(P.Result + ISNULL(M2.AltChar, LEFT(P.Remaining, 1)) AS VARCHAR(10)) AS Result,
                STUFF(P.Remaining, 1, 1, '') AS Remaining
            FROM CTE_Permutate P
            JOIN PlateCharacterMap M1 ON M1.AltChar = LEFT(P.Remaining, 1)
            JOIN PlateCharacterMap M2 ON M2.BaseChar = M1.BaseChar
            WHERE LEN(P.Remaining) > 0
            AND M2.AltChar <> M1.AltChar -- Exclude self (already included above)
        )
        SELECT P.Result AS PlatePermutation
        FROM CTE_Permutate P
        WHERE LEN(P.Remaining) = 0  -- Only include the completed mappings
    )
    

    Test code:

    -- Plate comparison test
    WITH CTE_TestData AS (
        SELECT *
        FROM (
            VALUES
                ('ABC-XYZ'),
                ('ABC-OIS'), ('ABC-O15'), ('ABC-015'),
                ('PASSWORD'), ('PA55W0RD')
        ) V(PlateNumber)
    )
    SELECT
        P1.PlateNumber AS PlateNumber1,
        P2.PlateNumber AS PlateNumber2,
        dbo.PlateCompareString(P1.PlateNumber) AS CompareString1,
        dbo.PlateCompareString(P2.PlateNumber) AS CompareString2,
        CASE WHEN dbo.PlateCompare(P1.PlateNumber, P2.PlateNumber) = 1
             THEN 'Match' ELSE '' END AS Match 
    FROM CTE_TestData P1
    JOIN CTE_TestData P2
        ON P2.PlateNumber > P1.PlateNumber
    ORDER BY P1.PlateNumber, P2.PlateNumber
    

    Results:

    PlateNumber1 PlateNumber2 CompareString1 CompareString2 Match
    ABC-015 ABC-O15 ABC-OIS ABC-OIS Match
    ABC-015 ABC-OIS ABC-OIS ABC-OIS Match
    ABC-015 ABC-XYZ ABC-OIS ABC-XYZ
    ABC-015 PA55W0RD ABC-OIS PASSWORD
    ABC-015 PASSWORD ABC-OIS PASSWORD
    ABC-O15 ABC-OIS ABC-OIS ABC-OIS Match
    ... ... ... ... ...
    PA55W0RD PASSWORD PASSWORD PASSWORD Match

    Plate permutation test:

    -- Permutation demo
    SELECT P.PlateNumber, PERM.PlatePermutation,
        COUNT(*) OVER(PARTITION BY P.PlateNumber) AS PermutationCount
    FROM (
        VALUES
            ('A-ZZZ'),  -- just this one
            ('B-1'),    -- 2 permutations
            ('C-S'),    -- 2 permutations
            ('D-Q'),    -- 3 permutations
            ('E-15'),   -- 2 x 2 = 4 permutations
            ('F-ISQ'),  -- 2 x 2 x 3 = 12 permutations
            ('G-OQ0'),  -- 3 x 3 x 3 = 27 permutations
            ('H-O0Q15') -- 3 x 3 x 3 x 2 x 2 = 108 permutations
    ) P(PlateNumber)
    CROSS APPLY dbo.PlatePermutations(P.PlateNumber) PERM
    ORDER BY P.PlateNumber, PERM.PlatePermutation
    

    Partial results:

    PlateNumber PlatePermutation PermutationCount
    A-ZZZ A-ZZZ 1
    B-1 B-1 2
    B-1 B-I 2
    C-S C-5 2
    C-S C-S 2
    D-Q D-0 3
    D-Q D-O 3
    D-Q D-Q 3
    E-15 E-15 4
    E-15 E-1S 4
    E-15 E-I5 4
    E-15 E-IS 4
    ... ... ...

    As for a WHERE clause for searches, you can use either the PlateCompare or PlatePermutations function. The former may be useful for individual compares, while the latter is better for searches

    CREATE TABLE PlateDatabase (
        PlateNumber VARCHAR(10)
    )
    INSERT PlateDatabase
    VALUES
        ('ABC-XYZ'),
        ('ABC-OIS'), ('ABC-O15'), ('ABC-015'),
        ('PASSWORD'), ('PA55W0RD')
    
    SELECT *
    FROM PlateDatabase P
    WHERE dbo.PlateCompare(P.PlateNumber, 'PASSW0RD') = 1
    
    SELECT *
    FROM PlateDatabase P
    WHERE P.PlateNumber IN (SELECT * FROM dbo.PlatePermutations('ABC-OI5'))
    
    SELECT P.*
    FROM dbo.PlatePermutations('ABC-OI5') PERM
    JOIN PlateDatabase P ON P.PlateNumber = PERM.PlatePermutation
    

    I expect the latter would have the best performance, since the join will likely involve an index seek (assuming that PlateNumber is indexed).

    Results (the last queries above have the same results):

    PlateNumber
    PASSWORD
    PA55W0RD
    PlateNumber
    ABC-OIS
    ABC-O15
    ABC-015

    See this db<>fiddle for a demo of everything discussed above.

    This may not be the best and final answer, but should get you pretty close. I added a couple of extra equivalences to the initial mapping. You can of course tailor these mappings to your specifications.

    SQL server databases typically are defined with a case-insensitive collation by default. Because I believe plate numbers typically only use uppercase letters, I did not go out of my way to override that default collation.