match words with few differences allowed

I was wondering is there is any tool to match almost the same word for a bash terminal.

In the following file, called list.txt contain 1 word per line:

ban
1ban
12ban
12ban3

It is easy to find words containing "ban"

grep -E "*ban*" list.txt

Question:

How to actually match words that are have x letters differences? With the search word "ban", I expect the match "1ban" for X=1.

Concerning the notion of distance, I want to have maximum: X deletion or X substitutions or X insertions

Any tool, but preferentially something you could call as command-line on a bash terminal.

NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.

Solution

You may use Python PyPi regex class that supports fuzzy matching.

Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like

#!/usr/bin/env python3
import regex, io, sys

def main(argv):
        if len(argv) < 3:
                # print("USAGE: fuzzy_search -searchword -xdiff -file")
                exit(-1)
        search=argv[0]
        xdiff=argv[1]
        file=argv[2]
        # print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
        with open(file, "r") as f:
                contents = f.read()
                print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))

if __name__ == "__main__":
        main(sys.argv[1:])

Here, {s<=1,i<=1,d<=1} means we allow the word we search for 1 or 0 substitutions (s<=1), 1 or 0 insertions (i<=1) or 1 or 0 deletions (d<=1).

The \b are word boundaries, thanks to that construct, only whole words are matched (no cat in vacation will get matched).

Save as fuzzy_search.py.

Then, you may call it as

python3 fuzzy_search.py "ban" 1 file

where "ban" is the word the fuzzy search is being performed for and 1 is the higher limit of differences.

The result I get is

['ban', '1ban']

You may change the format of the output to line only:

print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))

Then, the result is

ban
1ban