Search code examples
pythonnltkpython-newspaper

Remove special quotation marks and other characters


I am trying to download articles using Article from newspaper, and trying to tokenize the words using nltk word_tokenizer. The problem is, when I try to print the parsed article text, some of these articles have special quotation marks like , , , which aren't filtered out by the tokenizer, as it would a regular ' and ".

Is there a way to replace these special quotes with normal quotes, or better yet remove all possible speical characters that the tokenizer may miss out on?

I tried to remove these special characters by explicitly mentioning them in code, but it gives me the error Non-UTF-8 code starting with '\x92'.


Solution

  • Using the unidecode package would normally replace these characters with utf-8 ones.

    from unidecode import unidecode
    text = unidecode(text)
    

    A drawback, however, is that you would also change some characters (e.g. accentuated ones) which you may want to keep. If that is the case, an option is to use regular expressions to specifically erase (or replace) some pre-identified special characters:

    import re
    exotic_quotes = ['\\x92'] # fill this up
    text = re.sub(exotic_quotes, "'", text) # changing the second argument to fill the kind of quote you want to replace the exotic ones with
    

    I hope this helps !