Search code examples
regexlinuxbashterminalsimilarity

match words with few differences allowed


I was wondering is there is any tool to match almost the same word for a bash terminal.

In the following file, called list.txt contain 1 word per line:

ban
1ban
12ban
12ban3

It is easy to find words containing "ban"

grep -E "*ban*" list.txt

Question:

How to actually match words that are have x letters differences? With the search word "ban", I expect the match "1ban" for X=1.

Concerning the notion of distance, I want to have maximum: X deletion or X substitutions or X insertions

Any tool, but preferentially something you could call as command-line on a bash terminal.

NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.


Solution

  • You may use Python PyPi regex class that supports fuzzy matching.

    Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like

    #!/usr/bin/env python3
    import regex, io, sys
    
    def main(argv):
            if len(argv) < 3:
                    # print("USAGE: fuzzy_search -searchword -xdiff -file")
                    exit(-1)
            search=argv[0]
            xdiff=argv[1]
            file=argv[2]
            # print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
            with open(file, "r") as f:
                    contents = f.read()
                    print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))
    
    if __name__ == "__main__":
            main(sys.argv[1:])
    

    Here, {s<=1,i<=1,d<=1} means we allow the word we search for 1 or 0 substitutions (s<=1), 1 or 0 insertions (i<=1) or 1 or 0 deletions (d<=1).

    The \b are word boundaries, thanks to that construct, only whole words are matched (no cat in vacation will get matched).

    Save as fuzzy_search.py.

    Then, you may call it as

    python3 fuzzy_search.py "ban" 1 file
    

    where "ban" is the word the fuzzy search is being performed for and 1 is the higher limit of differences.

    The result I get is

    ['ban', '1ban']
    

    You may change the format of the output to line only:

    print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))
    

    Then, the result is

    ban
    1ban