I was wondering is there is any tool to match almost the same word for a bash terminal.
In the following file, called list.txt contain 1 word per line:
ban
1ban
12ban
12ban3
It is easy to find words containing "ban"
grep -E "*ban*" list.txt
Question:
How to actually match words that are have x letters differences? With the search word "ban", I expect the match "1ban" for X=1.
Concerning the notion of distance, I want to have maximum: X deletion or X substitutions or X insertions
Any tool, but preferentially something you could call as command-line on a bash terminal.
NOTE: The Levenshtein Distance will count an insertion of 2 letter as 1 difference. This is not what I want.
You may use Python PyPi regex class that supports fuzzy matching.
Since you actually want to match words with maximum X difference (1 deletion OR 1 substitution OR 1 deletion), you may create a Python script like
#!/usr/bin/env python3
import regex, io, sys
def main(argv):
if len(argv) < 3:
# print("USAGE: fuzzy_search -searchword -xdiff -file")
exit(-1)
search=argv[0]
xdiff=argv[1]
file=argv[2]
# print("Searching for {} in {} with {} differences...".format(search, file, xdiff))
with open(file, "r") as f:
contents = f.read()
print(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents))
if __name__ == "__main__":
main(sys.argv[1:])
Here, {s<=1,i<=1,d<=1}
means we allow the word we search for 1 or 0 substitutions (s<=1
), 1 or 0 insertions (i<=1
) or 1 or 0 deletions (d<=1
).
The \b
are word boundaries, thanks to that construct, only whole words are matched (no cat
in vacation
will get matched).
Save as fuzzy_search.py
.
Then, you may call it as
python3 fuzzy_search.py "ban" 1 file
where "ban"
is the word the fuzzy search is being performed for and 1
is the higher limit of differences.
The result I get is
['ban', '1ban']
You may change the format of the output to line only:
print("\n".join(regex.findall(r"\b(?:{0}){{s<={1},i<={1},d<={1}}}\b".format(regex.escape(search), xdiff), contents)))
Then, the result is
ban
1ban