Search code examples
pythonlanguage-agnosticmachine-learningnlpoov

Efficient way of resolving unknown words to known words?


I am designing a text processing program that will generate a list of keywords from a long itemized text document, and combine entries for words that are similar in meaning. There are metrics out there, however I have a new issue of dealing with words that are not in the dictionary that I am using.

I am currently using nltk and python, but my issues here are of a much more abstracted nature. Given a word that is not in a dictionary, what would be an efficient way of resolving it to a word that is within your dictionary? My only current solution involves running through the words in the dictionary and picking the word with the shortest Levenshtein distance (editing distance) from the inputted word.

Obviously this is a very slow and impractical method, and I don't actually need the absolute best match from within the dictionary, just so long as it is a contained word and it is pretty close. Efficiency is more important for me in the solution, but a basic level of accuracy would also be needed.

Any ideas on how to generally resolve some unknown word to a known one in a dictionary?


Solution

  • Looks like you need a spelling corrector to match words in your dictionary. The code below works and taken directly from this blog http://norvig.com/spell-correct.html written by Peter Norvig,

    import re, collections
    
    def words(text): return re.findall('[a-z]+', text.lower()) 
    
    def train(features):
        model = collections.defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model
    
    NWORDS = train(words(file('big.txt').read()))
    
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    
    def edits1(word):
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
        replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
        inserts    = [a + c + b     for a, b in splits for c in alphabet]
        return set(deletes + transposes + replaces + inserts)
    
    def known_edits2(word):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
    
    def known(words): return set(w for w in words if w in NWORDS)
    
    def correct(word):
        candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
        return max(candidates, key=NWORDS.get)
    

    big.txt is your dictionary containing known words.