Consider this, I have a list; list1=['car', 'bike', 'van', 'class']
, and I am parsing a text file. This text file can contain other arbitrary words that are not in this list and are not misspellings of words in this list.
If 'ca' is in the file, my algorithm would replace it with 'car'
If the file contains 'bke' it is replaced with 'bike'
If the file contains 'clss', it is replaced with 'class'
My algorithm is basically an error correction algorithm. How do I replace the relevant misspelled words with the words in the list?
Any answer to the question will be appreciated!
Using a levenshtein algorithm, you can do this:
tgt_list='ca bke clss'.split()
for word in ['car','bike','van','class']:
wdist_exp=((w, levenshtein(w, word)) for w in tgt_list)
closest, dist=min(wdist_exp, key=lambda t: t[1])
print '{}=>{} ld={}'.format(closest,word,dist)
Prints:
ca=>car ld=1
bke=>bike ld=1
ca=>van ld=2
clss=>class ld=1
It is also possible with the regex module:
import regex
template='{}=>{} with {} substitutions, {} insertions, {} deletions'
tgt='ca bke clss'
for word in ['car','bike','van','class']:
pat=r'((?:\b{}\b){{e<=2}})'.format(word)
m=regex.search(pat, tgt, regex.BESTMATCH)
if m:
print template.format(m.group(1),word,*m.fuzzy_counts)
Prints:
ca =>car with 1 substitutions, 0 insertions, 0 deletions
bke=>bike with 0 substitutions, 0 insertions, 1 deletions
ca =>van with 2 substitutions, 0 insertions, 0 deletions
clss=>class with 0 substitutions, 0 insertions, 1 deletions
You might want to investigate Python's difflib module with a similar approach used here.