Is there a way to do fuzzy regex matching in Julia?
I have constructed the following regular expression test:
toMatch = Regex(word,"i")
ismatch(toMatch,input_string)
I would like to be able to do this test but allow for some latitude in the matching and to specify this by Levenshtein distance.
I have found the package Levenshtein which can calculate the distance but am not sure how to incorporate it into this logic. For example:
levenshtein("hello","hllo")`
> 1
(This answer has nothing to do with regular expressions, but it covers some use cases.)
I don't know if this works for your use case. But it looks like you are trying to find whether a word (or a close misspelling) is in your text. If the text is separated by spaces, and your word does not contain spaces, you could try something like:
nopunct(s) = filter(c -> !ispunct(c), s)
nfcl(s) = normalize_string(s, decompose=true, compat=true, casefold=true,
stripmark=true, stripignore=true)
canonicalize(s) = nopunct(nfcl(s))
fuzzy(needle, haystack, n) = any(
w -> levenshtein(w, canonicalize(needle)) < n,
split(canonicalize(haystack)))
What this does is, roughly:
nfcl
normalizes strings with similar "human" appearances, by stripping out accents, ignoring case, and performing unicode normalization. This is pretty useful for fuzzy matching:
julia> nfcl("Ce texte est en français.")
"ce texte est en francais."
nopunct
strips punctuation characters, further simplifying the string.
julia> nopunct("Hello, World!")
"Hello World"
canonicalize
simply combines these two transformations.
Then we check whether any of the words in the haystack (split by whitespace) are within n
of the needle.
Examples:
julia> fuzzy("Robert", "My name is robrt.", 2)
true
julia> fuzzy("Robert", "My name is john.", 2)
false
This is by no means a complete solution, but it covers a lot of common use cases. For more advanced use cases, you should look into the subject in more depth.