python parsing data-conversion error-correction

python related word search and replace with list words

Consider this, I have a list; list1=['car', 'bike', 'van', 'class'], and I am parsing a text file. This text file can contain other arbitrary words that are not in this list and are not misspellings of words in this list.

If 'ca' is in the file, my algorithm would replace it with 'car'
If the file contains 'bke' it is replaced with 'bike'
If the file contains 'clss', it is replaced with 'class'

My algorithm is basically an error correction algorithm. How do I replace the relevant misspelled words with the words in the list?

Any answer to the question will be appreciated!

Solution

Using a levenshtein algorithm, you can do this:

tgt_list='ca bke clss'.split()    
for word in ['car','bike','van','class']:
    wdist_exp=((w, levenshtein(w, word)) for w in tgt_list)
    closest, dist=min(wdist_exp, key=lambda t: t[1])
    print '{}=>{}   ld={}'.format(closest,word,dist)

Prints:

ca=>car   ld=1
bke=>bike   ld=1
ca=>van   ld=2
clss=>class   ld=1

It is also possible with the regex module:

import regex    

template='{}=>{} with {} substitutions, {} insertions, {} deletions'
tgt='ca bke clss'
for word in ['car','bike','van','class']:
    pat=r'((?:\b{}\b){{e<=2}})'.format(word)
    m=regex.search(pat, tgt, regex.BESTMATCH)     
    if m:
        print template.format(m.group(1),word,*m.fuzzy_counts)

Prints:

ca =>car with 1 substitutions, 0 insertions, 0 deletions
bke=>bike with 0 substitutions, 0 insertions, 1 deletions
ca =>van with 2 substitutions, 0 insertions, 0 deletions
clss=>class with 0 substitutions, 0 insertions, 1 deletions

You might want to investigate Python's difflib module with a similar approach used here.