Search code examples
juliafuzzy-comparison

fuzzy regex matching in julia


Is there a way to do fuzzy regex matching in Julia?

I have constructed the following regular expression test:

toMatch = Regex(word,"i")
ismatch(toMatch,input_string)

I would like to be able to do this test but allow for some latitude in the matching and to specify this by Levenshtein distance.

I have found the package Levenshtein which can calculate the distance but am not sure how to incorporate it into this logic. For example:

levenshtein("hello","hllo")`
> 1

Solution

  • (This answer has nothing to do with regular expressions, but it covers some use cases.)

    I don't know if this works for your use case. But it looks like you are trying to find whether a word (or a close misspelling) is in your text. If the text is separated by spaces, and your word does not contain spaces, you could try something like:

    nopunct(s) = filter(c -> !ispunct(c), s)
    nfcl(s) = normalize_string(s, decompose=true, compat=true, casefold=true,
                                  stripmark=true, stripignore=true)
    canonicalize(s) = nopunct(nfcl(s))
    fuzzy(needle, haystack, n) = any(
        w -> levenshtein(w, canonicalize(needle)) < n,
        split(canonicalize(haystack)))
    

    What this does is, roughly:

    nfcl normalizes strings with similar "human" appearances, by stripping out accents, ignoring case, and performing unicode normalization. This is pretty useful for fuzzy matching:

    julia> nfcl("Ce texte est en français.")
    "ce texte est en francais."
    

    nopunct strips punctuation characters, further simplifying the string.

    julia> nopunct("Hello, World!")
    "Hello World"
    

    canonicalize simply combines these two transformations.

    Then we check whether any of the words in the haystack (split by whitespace) are within n of the needle.

    Examples:

    julia> fuzzy("Robert", "My name is robrt.", 2)
    true
    
    julia> fuzzy("Robert", "My name is john.", 2)
    false
    

    This is by no means a complete solution, but it covers a lot of common use cases. For more advanced use cases, you should look into the subject in more depth.