I have a column that contains license plates. The plate data somes from multiple sources and is not completely accurate. I must assume that zero's are misinterpreted as 'O' and vice versa. This includes multiple varying results in the same plate (OCR says 00OABC, when actual plate is 0O0ABC). To handle this, I need to build a where clause that uses all possible replacement values for every instance zero and the letter 'O'. How can this be done in tslq? Examples:
Plate, (where clause needed)
ABC012, (ABC012, ABCO12)
0OABC, (00ABC, OOABC, 0OABC, OOABC)
000XYZ, (000XYZ,00OXYZ,0OOXYZ,0O0XYZ,O00XYZ,O0OXYZ,OOOXYZ,OO0XYZ)
Edit: I need to build a where in clause, because I cannot do a replace in my select statement (example: select * from plates where replace(plate,'O','0') = '000XYZ'). I cannot do a replace because there is too much data. If this were a small table or if this were hadoop, it would be fine. What I have to work with is a MSSQL DB with a 20bn row table.
The solution to this problem revolves around defining character equivalences.
Just as case-insensitive compares treat uppercase letters as equivalent to the corresponding lowercase letters, and accent-insensitive comparisons will treat "à", "á", "â", and others as equivalent to "a", we need to define character equivalences for similar-looking characters.
Once we have done that, we can define functions for comparing plate numbers having similar appearances and to generate sets of similar appearing plates.
For comparisons, I believe that most case-insensitive and accent-insensitive compares will map equivalent characters to a preferred form (such as all accent-free lowercase) before doing the comparison. We can do the same by defining a translation of all alternate characters to their equivalent base or preferred forms before doing any comparison.
To generate sets of similar appearing plates, we can use a recursive CTE (common table expression) that steps through the plate number one character at a time, while mapping each character to all defined equivalent characters, including itself. Each character would be mapped independently, so the the final result includes all combinations (a cross product) of each character mapping.
I've put together some sample code as follows:
-- This table defines character mappings for similar looking characters.
-- The BaseChar value is considered the primary for each equivalence set,
-- while the AltChar values are the alternate characters that should be
-- considered to the base character and to any alternates sharing the same
-- base character.
CREATE TABLE PlateCharacterMap (
BaseChar CHAR, -- Source character from reader or database
AltChar CHAR -- Equivalent character for comparison purposes
)
INSERT PlateCharacterMap
VALUES
('O', '0'),
('I', '1'),
('S', '5'),
('O', 'Q')
-- Also map any defined base charcaters to themselves
-- to support later processing logic.
INSERT PlateCharacterMap
SELECT DISTINCT BaseChar, BaseChar
FROM PlateCharacterMap
To support direct compares between two potentially similar plate numbers, we need to define TranslateFrom
and TranslateTo
strings that can be used with the TRANSLATE()
function.
-- This single-row table is used to define translations for comparing any two
-- plate numbers. All alternate characters will be mapped to their associated
-- base characters.
CREATE TABLE PlateTranslateStrings (
TranslateFrom VARCHAR(1000),
TranslateTo VARCHAR(1000)
)
INSERT PlateTranslateStrings
SELECT
STRING_AGG(AltChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateFrom,
STRING_AGG(BaseChar, '') WITHIN GROUP(ORDER BY AltChar) AS TranslateTo
FROM PlateCharacterMap
WHERE AltChar <> BaseChar
The above would calculate TranslateFrom = '015Q'
and TranslateTo = 'OISO'
.
Functions for (1) plate number mapping, (2) plate number comparison, and (3) plate number permutations follow:
-- Map a plate number to a string suitable for fuzzy comparisons
CREATE FUNCTION PlateCompareString(@PlateNumber VARCHAR(10))
RETURNS VARCHAR(10)
AS
BEGIN
RETURN (
SELECT TRANSLATE(@PlateNumber, T.TranslateFrom, T.TranslateTo)
FROM PlateTranslateStrings T
)
END
-- Perform a fuzzy comparison of plate numbers
CREATE FUNCTION PlateCompare(@Left VARCHAR(10), @Right VARCHAR(10))
RETURNS BIT
AS
BEGIN
RETURN CASE WHEN dbo.PlateCompareString(@Left) = dbo.PlateCompareString(@Right)
THEN 1 ELSE 0 END
END
-- Generate a set of similar plates for a given plate number.
CREATE FUNCTION PlatePermutations(@PlateNumber VARCHAR(10))
RETURNS TABLE
AS
RETURN (
WITH CTE_Permutate AS (
-- Anchor - Seed with empty result.
SELECT
CAST('' AS VARCHAR(10)) AS Result,
@PlateNumber AS Remaining
UNION ALL
-- Map next character to itself.
SELECT
CAST(P.Result + LEFT(P.Remaining, 1) AS VARCHAR(10)) AS Result,
STUFF(P.Remaining, 1, 1, '') AS Remaining
FROM CTE_Permutate P
WHERE LEN(P.Remaining) > 0
UNION ALL
-- Also map next character to all characters sharing the same base character
-- except itself.
SELECT
CAST(P.Result + ISNULL(M2.AltChar, LEFT(P.Remaining, 1)) AS VARCHAR(10)) AS Result,
STUFF(P.Remaining, 1, 1, '') AS Remaining
FROM CTE_Permutate P
JOIN PlateCharacterMap M1 ON M1.AltChar = LEFT(P.Remaining, 1)
JOIN PlateCharacterMap M2 ON M2.BaseChar = M1.BaseChar
WHERE LEN(P.Remaining) > 0
AND M2.AltChar <> M1.AltChar -- Exclude self (already included above)
)
SELECT P.Result AS PlatePermutation
FROM CTE_Permutate P
WHERE LEN(P.Remaining) = 0 -- Only include the completed mappings
)
Test code:
-- Plate comparison test
WITH CTE_TestData AS (
SELECT *
FROM (
VALUES
('ABC-XYZ'),
('ABC-OIS'), ('ABC-O15'), ('ABC-015'),
('PASSWORD'), ('PA55W0RD')
) V(PlateNumber)
)
SELECT
P1.PlateNumber AS PlateNumber1,
P2.PlateNumber AS PlateNumber2,
dbo.PlateCompareString(P1.PlateNumber) AS CompareString1,
dbo.PlateCompareString(P2.PlateNumber) AS CompareString2,
CASE WHEN dbo.PlateCompare(P1.PlateNumber, P2.PlateNumber) = 1
THEN 'Match' ELSE '' END AS Match
FROM CTE_TestData P1
JOIN CTE_TestData P2
ON P2.PlateNumber > P1.PlateNumber
ORDER BY P1.PlateNumber, P2.PlateNumber
Results:
PlateNumber1 | PlateNumber2 | CompareString1 | CompareString2 | Match |
---|---|---|---|---|
ABC-015 | ABC-O15 | ABC-OIS | ABC-OIS | Match |
ABC-015 | ABC-OIS | ABC-OIS | ABC-OIS | Match |
ABC-015 | ABC-XYZ | ABC-OIS | ABC-XYZ | |
ABC-015 | PA55W0RD | ABC-OIS | PASSWORD | |
ABC-015 | PASSWORD | ABC-OIS | PASSWORD | |
ABC-O15 | ABC-OIS | ABC-OIS | ABC-OIS | Match |
... | ... | ... | ... | ... |
PA55W0RD | PASSWORD | PASSWORD | PASSWORD | Match |
Plate permutation test:
-- Permutation demo
SELECT P.PlateNumber, PERM.PlatePermutation,
COUNT(*) OVER(PARTITION BY P.PlateNumber) AS PermutationCount
FROM (
VALUES
('A-ZZZ'), -- just this one
('B-1'), -- 2 permutations
('C-S'), -- 2 permutations
('D-Q'), -- 3 permutations
('E-15'), -- 2 x 2 = 4 permutations
('F-ISQ'), -- 2 x 2 x 3 = 12 permutations
('G-OQ0'), -- 3 x 3 x 3 = 27 permutations
('H-O0Q15') -- 3 x 3 x 3 x 2 x 2 = 108 permutations
) P(PlateNumber)
CROSS APPLY dbo.PlatePermutations(P.PlateNumber) PERM
ORDER BY P.PlateNumber, PERM.PlatePermutation
Partial results:
PlateNumber | PlatePermutation | PermutationCount |
---|---|---|
A-ZZZ | A-ZZZ | 1 |
B-1 | B-1 | 2 |
B-1 | B-I | 2 |
C-S | C-5 | 2 |
C-S | C-S | 2 |
D-Q | D-0 | 3 |
D-Q | D-O | 3 |
D-Q | D-Q | 3 |
E-15 | E-15 | 4 |
E-15 | E-1S | 4 |
E-15 | E-I5 | 4 |
E-15 | E-IS | 4 |
... | ... | ... |
As for a WHERE
clause for searches, you can use either the PlateCompare
or PlatePermutations
function. The former may be useful for individual compares, while the latter is better for searches
CREATE TABLE PlateDatabase (
PlateNumber VARCHAR(10)
)
INSERT PlateDatabase
VALUES
('ABC-XYZ'),
('ABC-OIS'), ('ABC-O15'), ('ABC-015'),
('PASSWORD'), ('PA55W0RD')
SELECT *
FROM PlateDatabase P
WHERE dbo.PlateCompare(P.PlateNumber, 'PASSW0RD') = 1
SELECT *
FROM PlateDatabase P
WHERE P.PlateNumber IN (SELECT * FROM dbo.PlatePermutations('ABC-OI5'))
SELECT P.*
FROM dbo.PlatePermutations('ABC-OI5') PERM
JOIN PlateDatabase P ON P.PlateNumber = PERM.PlatePermutation
I expect the latter would have the best performance, since the join will likely involve an index seek (assuming that PlateNumber
is indexed).
Results (the last queries above have the same results):
PlateNumber |
---|
PASSWORD |
PA55W0RD |
PlateNumber |
---|
ABC-OIS |
ABC-O15 |
ABC-015 |
See this db<>fiddle for a demo of everything discussed above.
This may not be the best and final answer, but should get you pretty close. I added a couple of extra equivalences to the initial mapping. You can of course tailor these mappings to your specifications.
SQL server databases typically are defined with a case-insensitive collation by default. Because I believe plate numbers typically only use uppercase letters, I did not go out of my way to override that default collation.