Search code examples
pythonspell-checkingautocorrect

pyspellchecker: do not split URL


I tried to set-up an autocorrect using pyspellchecker in Python. In general, it does work, however it currently also splits the URLs, which is not really desired. The code is as following:

from spellchecker import SpellChecker

spell = SpellChecker()
words = spell.split_words("This is my URL https://test.com")
test = [spell.correction(word) for word in words]

This result in the following: ['This', 'is', 'my', 'URL', 'steps', 'test', 'com']

What do I have to change that all URLs are not autocorrected?


Solution

  • NLTK's TweetTokenizer correctly tokenizes URLs, hashtags, and emoticons.

    >>> from nltk.tokenize import TweetTokenizer
    >>> tknzr = TweetTokenizer()
    >>> tknzr.tokenize(s)
    ['This', 'is', 'my', 'URL', 'https://test.com']
    

    NLTK comes with a variety of state-of-the-art word tokenization primitives. I suggest you use NLTK to turn your string into words before filtering for autocorrection. You can use NLTK's part-of-speech utilities to determine what things should be autocorrected.