Search code examples
pythonparsingdata-conversionerror-correction

python related word search and replace with list words


Consider this, I have a list; list1=['car', 'bike', 'van', 'class'], and I am parsing a text file. This text file can contain other arbitrary words that are not in this list and are not misspellings of words in this list.

  • If 'ca' is in the file, my algorithm would replace it with 'car'

  • If the file contains 'bke' it is replaced with 'bike'

  • If the file contains 'clss', it is replaced with 'class'

My algorithm is basically an error correction algorithm. How do I replace the relevant misspelled words with the words in the list?

Any answer to the question will be appreciated!


Solution

  • Using a levenshtein algorithm, you can do this:

    tgt_list='ca bke clss'.split()    
    for word in ['car','bike','van','class']:
        wdist_exp=((w, levenshtein(w, word)) for w in tgt_list)
        closest, dist=min(wdist_exp, key=lambda t: t[1])
        print '{}=>{}   ld={}'.format(closest,word,dist) 
    

    Prints:

    ca=>car   ld=1
    bke=>bike   ld=1
    ca=>van   ld=2
    clss=>class   ld=1
    

    It is also possible with the regex module:

    import regex    
    
    template='{}=>{} with {} substitutions, {} insertions, {} deletions'
    tgt='ca bke clss'
    for word in ['car','bike','van','class']:
        pat=r'((?:\b{}\b){{e<=2}})'.format(word)
        m=regex.search(pat, tgt, regex.BESTMATCH)     
        if m:
            print template.format(m.group(1),word,*m.fuzzy_counts)
    

    Prints:

    ca =>car with 1 substitutions, 0 insertions, 0 deletions
    bke=>bike with 0 substitutions, 0 insertions, 1 deletions
    ca =>van with 2 substitutions, 0 insertions, 0 deletions
    clss=>class with 0 substitutions, 0 insertions, 1 deletions
    

    You might want to investigate Python's difflib module with a similar approach used here.