Search code examples
pythonnlp

Python: how to automatically spellcheck and correct joined words such as "reportthatexplains" and "havebeen"


I have some large text files which are in correct English because extracted from pdfs. However, many words in these text files are joined: "informationotherwise", "havebeen", "reportthatexplains". Every spell checker will spot these errors, e.g. LanguageTool, Sublime, MS-Word. However, Python struggles.

I tried pyspellchecker and TextBlob to check and correct these words, but, alas, to no avail.

See for example this code, which returns None three times.

misspelled = spell.unknown(["informationotherwise", "havebeen", "reportthatexplains"])

for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

And this code:

t ="havebeen"
TextBlob(t).correct().string

>>> 'havebeen'

Any suggestions?


Solution

  • Use word ninja library for splitting long word into sub word

    import wordninja
    word  = ["informationotherwise", "havebeen", "reportthatexplains"]
    for x in word :
        print(' '.join(wordninja.split(x)))
    
     #op
     information otherwise
     have been
     report that explains