I have created a custom corpus using nltk's CategorizedPlaintextCorpusReader
.
There are unicode characters within the .txt files of my corpus which I'm unable to decode. I assume it's the fact this is a "plaintext" reader but need to decode this nonetheless.
Code:
import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
import os
mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt',
cat_pattern=os.path.join(r'(neg|pos)', '.*',))
for w in mr.words():
print(w)
This will print the words of files that do not contain unicode in tokenized format, and then throws the following error:
for w in mr.words():
File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 402, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "C:\Python\Python36-32\lib\site-packages\nltk\corpus\reader\plaintext.py", line 122, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1168, in readline
new_chars = self._read(readsize)
File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1400, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python\Python36-32\lib\site-packages\nltk\data.py", line 1431, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python\Python36-32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 30: invalid start byte
I have attempted to decode with:
mr.decode('unicode-escape')
which throws this error:
AttributeError: 'CategorizedPlaintextCorpusReader' object has no attribute 'decode'
I am using Python 3.6.4.
The problem is that NLTK's corpus reader assumes that your plain-text files were saved with UTF-8 encoding.
However, this assumption is apparently wrong, as the files were encoded with another codec.
My guess is that CP1252 (aka "Windows Latin-1") was used, because it's quite popular and it fits your description well: in that encoding, the em dash "–" is encoded with the byte 0x96
, which is mentioned in the error message.
You can specify the encoding of the input files in the constructor of the corpus reader:
mr = CategorizedPlaintextCorpusReader(
'C:\mycorpus',
r'(?!\.).*\.txt',
cat_pattern=os.path.join(r'(neg|pos)', '.*',),
encoding='cp1252')
Try this, and check if the non-ASCII characters (em dash, bullet) are still correct in the output (and not replaced with mojibake).