Search code examples
stringoptimizationnlpdata-analysislevenshtein-distance

How to find strings of a list in a text with typo's


I'm trying to check if some String in a list are in a given text. But the given text can have some typos. For example let's take this.

text: The brownw focx and the cat are in th eforest. and my list is: [brown fox, forest, cat]

What I do actually to do this is that I separate my text in multiple groups, groups of one word and two words like so: [The, brownw, focx, and, the, cat, are, in, th, eforest, The brownw, brownw focx, focx and, and the, the cat, cat are, are in, in th, th eforest]

Than I iterate over each group of word and check with the Levensthein algorithm how much the two strings match with each other. In case it's more than 90% I consider they are the same.

This approach however is very time consuming and I wonder if I can find an alternative to this.


Solution

  • Instead of using the full Levenshtein distance (which is slow to compute), you could do a couple of sanity checks beforehand, to try and exclude candidates which are obviously wrong:

    1. word length: the will never match brown fox, as it is far too short. Count the word length, and exclude all candidates that are more than a few letters shorter or longer.
    2. letters: just check what letters are in the word. for example, the does not contain a single letter from fox, so you can rule it out straightaway. With short words it might not make a big difference in performance, but for longer words it will do. Additional optimisation: look for rare characters (x,q,w) first, or simply ignore common ones (e,t,s) which are more likely to be present anyway.

    Heuristics such as these will of course not give you the right answer, but they can help to filter out those that are definitely not going to match. Then you only need to perform the more expensive full check on a much smaller number of candidate words.