Search code examples
pythonunicodenltkpython-unicodeunicode-escapes

NLTK - Decoding Unicode in custom corpus


I have created a custom corpus using nltk's CategorizedPlaintextCorpusReader.

There are unicode characters within the .txt files of my corpus which I'm unable to decode. I assume it's the fact this is a "plaintext" reader but need to decode this nonetheless.

Code:

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
import os



mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
        cat_pattern=os.path.join(r'(neg|pos)', '.*',))

for w in mr.words():
    print(w)

This will print the words of files that do not contain unicode in tokenized format, and then throws the following error:

for w in mr.words():
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 402, in iterate_from
    for tok in piece.iterate_from(max(0, start_tok-offset)):
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline()))
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1168, in readline
    new_chars = self._read(readsize)
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1400, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1431, in _incr_decode
    return self.decode(bytes, 'strict')
  File "C:\Python\Python36-32\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 30: invalid start byte

I have attempted to decode with:

mr.decode('unicode-escape') 

which throws this error:

AttributeError: 'CategorizedPlaintextCorpusReader' object has no attribute 'decode'

I am using Python 3.6.4.


Solution

  • The problem is that NLTK's corpus reader assumes that your plain-text files were saved with UTF-8 encoding. However, this assumption is apparently wrong, as the files were encoded with another codec. My guess is that CP1252 (aka "Windows Latin-1") was used, because it's quite popular and it fits your description well: in that encoding, the em dash "–" is encoded with the byte 0x96, which is mentioned in the error message.

    You can specify the encoding of the input files in the constructor of the corpus reader:

    mr = CategorizedPlaintextCorpusReader(
        'C:\mycorpus',
        r'(?!\.).*\.txt',
        cat_pattern=os.path.join(r'(neg|pos)', '.*',),
        encoding='cp1252')
    

    Try this, and check if the non-ASCII characters (em dash, bullet) are still correct in the output (and not replaced with mojibake).