Search code examples
pythonperformancenlpnltkspacy

Python: Improving performance of code performing spelling correction on text data


I have a text-data in form of comments that I want to preprocess. Apart from cutting away noise like URLs, numbers, ... and performing lemmatization, I also want to perform spelling correction. Specifically, I want to perform spelling correction only on words that do not occur more often than a given number of times to avoid false positives. For that purpose, I use pyspellchecker for the correction and nltks FreqDist to get word frequencies, however, doing that increases the time needed for preprocessing significantly.

I tried making things as performant as I could, but I am stuck and was wondering if there are still improvements I could make.

Here is my code: Imports:

from spacy.lang.en import English
from spellchecker import SpellChecker
from nltk.probability import FreqDist
nlp = spacy.load("en_core_web_sm")
spell = SpellChecker()
fdist = FreqDist()

Code:

dict_misspell = {}

pipe = nlp.pipe(list_of_comments, batch_size = 512 ,disable = ["tagger", "parser"])
    for j, doc in enumerate(pipe):
        tokens = [token.lemma_.lower() for token in doc if not token.is_punct and not token.is_digit\
                                  and not token.like_url and not token.like_email and not token.like_num]
        processed_comments.append(" ".join(tokens))
        fdist += FreqDist(tokens)
        
        #remember which comments contain missspellings to avoid having to look at every comment later
        misspelled = spell.unknown(tokens)
        if (len(misspelled) > 0):
            for misspelled_word in misspelled:
                if misspelled_word in dict_misspell.keys():
                    dict_misspell[misspelled_word].append(j)
                else:
                    dict_misspell[misspelled_word] = [j]
    
    #spell correction is done after the rest because only then is the frequency dict fully build.
    for k, mis in enumerate(dict_misspell.keys()):
        if(fdist[mis] <= 5):  #only fix below certain word frequency to avoid false positives
            missspelling_idxs = dict_misspell[mis]
            correct_spelling = spell.correction(mis)
            for idx in missspelling_idxs:
                processed_comments[idx] = processed_comments[idx].replace(mis, correct_spelling)

As you can see above, I preprocess each individual comment, add all words of that comment to the frequency dictionary and for each word that the spellchecker considers misspelled I save those words and the index of the comment in which they occur in a misspell dictionary. After doing that the frequency dictionary is fully built and I start correcting possibly misspelled words who's frequency meet a condition in the individual comments.

Does anyone see a way to improve performance here?


Solution

  • Spell-checking is a rather heavy processing.

    You can try to filter out some tokens in dict_misspell, in order to call correction on less words. You can analyse the unknown words of a subset of your comments and create some rules to filter some kind of tokens.

    Exemple : words with less than 2 characters; ones having numbers inside; emojis; named entities; ...).