Search code examples
pythonnlpnltk

NLP how to speed up spelling correction on 147k rows filled with short messages


Trying to speed up spelling check on large dataset with 147k rows. The following function has been running for an entire afternoon and is still running. Is there a way to speed up the spelling check? The messages has already been case treated, punctuations removed, lemmatized and they are all in string format.

import autocorrect
from autocorrect import Speller
spell = Speller()

def spell_check(x):
    correct_word = []
    mispelled_word = x.split()
    for word in mispelled_word:
        correct_word.append(spell(word))
    return ' '.join(correct_word)

df['clean'] = df['old'].apply(spell_check)

Solution

  • Additionally to what @Amadan said and is definitely true (autocorrect does the correction in a very ineffective way):

    You treat each word in the giant dataset as if all words in it are looked up for the first time, because you call spell() on each word. In reality (at least after a while) almost all words were previously looked up, so storing these results and loading them would be much more efficient.

    Here is one way to do it:

    import autocorrect
    from autocorrect import Speller
    spell = Speller()
    
    # get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
    unique_words = {word for words in df["old"].apply(str.split) for word in words}
    
    # get the corrected version of each unique word and put this mapping in a dictionary
    corrected_words = {word: spell(word) for word in unique_words}
    
    # write the cleaned row by looking up the corrected version of each unique word
    df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]