Search code examples
pythonnlpnltkspell-checking

how to modify peter norvig spell checker to get more number of suggestions per word


I tried the peter norvig code for spellchecker on http://norvig.com/spell-correct.html but how do i modify it to get more number of suggestions instead of just 1 correct spelling

import re
from collections import Counter

def words(text):
     return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or 
           [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in 
                  letters]
    inserts    = [L + c + R               for L, R in splits for c in 
    letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))import re

Solution

  • You can use the candidates function.

    It gives you

    • the original word if it is correct already
    • otherwise, all known words with an edit distance of 1 to the original word
    • if there is no candidate with distance 1, then all candidates with distance 2
    • if there was nothing in the previous case, then the original word

    If candidates are found in case 2 or 3, then the returned set may contain more than one suggestion.

    If the original word is returned, however, you don't know if it's the case because it's correct (case 1), or because there are no close candidates (case 4).

    However,

    this approach (the way how edits1() is implemented) is brute force, and it's really inefficient for long words, and it gets worse if you add more characters (eg. for supporting other languages). Consider something like simstring for efficiently retrieving words with similar spelling in a large collection.