algorithm string-matching levenshtein-distance

Check if one string includes a substring with Levenshtein distance of 1 from other string

My problem is that we want our users to enter the code like this: 639195-EM-66-XA-53-WX somewhere in the input, so the result may look like this: The code is 639195-EM-66-XA-53-WX, let me in. We still want to match the string if they make a small error in the code (Levenshtein distance of 1). For example The code is 739195-EM-66-XA-53-WX, let me in. (changed 6 to 7 in the first letter of the code)

The algorithm should match even if user skips dashes, and it should ignore lowercase/uppercase letters. These requirements are easy to fulfil, because I can remove all dashes and do to_uppercase.

Is there an algorithm for something like that?

Generating all strings with the distance of 1 from original code is computationally expensive.

I was also thinking about using something like Levenshtein distance, but ignoring missing letters that user added in the second string, but that would allow wrong letters in the middle of the code.

Searching for the code in user input seems a little bit better, but still not very clean.

Solution

I had an idea for a solution, maybe this is good enough for you:

As you said, first remove the dashes and make everything upper (or lower) case:

Sentence: THE CODE IS 639195EM66XA53WX, LET ME IN

Code: 639195EM66XA53WX

Split the code in the middle (c1 and c2), because Levenshtein distance of 1 means that there can only be one mistake (insertion, deletion or replacement of a single character), so one of c1 or c2 has to match if the code is present in the sentence with just 1 or less mistakes. Splitting in the middle because the longer both substrings of the code are the fewer matches you should get:

c1: 639195EM

c2: 66XA53WX

Now try to find c1 and c2 in your sentence, if you find a match then you either have to go forward (c1 matched) or backwards (c2 matched) in the sentence to check if the Levenshtein distance of the missing part is 1 or less.

So in your example you would find c2 and then:

Set pointers to the last character of c1 and the character before the match.
While the characters are the same reduce both pointers by 1 (go backwards in both strings).
If you can consume c1 completely this way you found an exact match (Levenshtein distance of 0).
Otherwise try the 3 possibilities for Levenshtein distance of 1:
1. Only move the pointer of the c1 backwards and see if the rest matches (deletion).
2. Only move the pointer of the sentence backwards and see if the rest matches (insertion).
3. Move both pointers backwards and see if the rest matches (replacement).
If one of them succeeds you found a match with Levenshtein distance of 1, otherwise the distance is higher.