Search code examples
pythonedit-distance

How to check if a similar sub-string has appeared in string with customise tolerance level


How to check whether a substirng is inside a string with specific edit distance tolerance. For example:

str = 'Python is a multi-paradigm, dynamically typed, multipurpose programming language, designed to be quick (to learn, to use, and to understand), and to enforce a clean and uniform syntax.'
substr1 = 'ython'
substr2 = 'thon'
substr3 = 'cython'
edit_distance_tolerance = 1

substr_in_str(str, substr1, edit_distance_tolerance)
>> True

substr_in_str(str, substr2, edit_distance_tolerance)
>> False

substr_in_str(str, substr3, edit_distance_tolerance)
>> True

What I tried: I tried to break the string in words and remove the special characters then do comparisons one by one but the performance(in terms of speed and accuracy) is not quite good.


Solution

  • The answer is not so simple as you think , and you will need a lot of mathematics to achieve this and standard re(regex) library can't solve this broblem . I think TRE library has solved this problem to a big extend , see here https://github.com/laurikari/tre/