Search code examples
pythondata-sciencelinguistics

Efficient autocorrect on entire text files with python?


I am currently preprocessing some 100000's of sentences. To improve our ML prediction we probably should run some sort of autocorrect/spellchecking on the data. However most implementation in python i found so far are slow. Is there an efficient and easy way to auto-correct an entire text file in python?

I tried to work with this in https://github.com/phatpiglet/autocorrect/ but it takes relatively long (I did not implement it well, but I guess someone has already done it somewhere)


Solution

  • As @Vishnudev mentioned, prefer using SymSpellCompound

    According to benchmarks it's faster than other spelling correction implementations by orders of magnitude. Please refer to this graph

    If you read the code behind autocorrect, it mentions that it's based on Peter Norvig's implementation available here

    Also tried benchmarking spacy_hunspell but couldn't manage to improve performance timings by more than +15-2O%

    Other improvements tracks:

    • make use of python multiprocessing module.
    • if you're using pandas, please think about using Dask framework for parallel processing.

    Good luck in your task !