I first convert a pdf into plain text (I print it out and everything is fine) and then I get a UnicodeDecodeError when I try to run word_tokenize() from NLTK.
I get that error despite I try to decode('utf-8').encode('utf-8') on the plain text, beforehand. In the traceback I noticed that the line of code from word_tokenize() that raises the error first is plaintext.split('\n'). This is why I tried to reproduce the error by running split('\n') on the plain text but still, that doesn't rise any error either.
So, I understand neither what is causing the error nor how to avoid it.
Any help would be greatly appreciate it! :) maybe I could avoid it by changing something in the pdf_to_txt file?
Here's the code to tokenize:
from cStringIO import StringIO
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os
import string
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
stopset = stopwords.words('english')
path = 'my_folder'
listing = os.listdir(path)
for infile in listing:
text = self.convert_pdf_to_txt(path+infile)
text = text.decode('utf-8').encode('utf-8').lower()
print text
splitted = text.split('\n')
filtered_tokens = [i for i in word_tokenize(text) if i not in stopset and i not in string.punctuation]
Here's the method I call in order to convert from pdf to txt:
def convert_pdf_to_txt(self, path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
ret = retstr.getvalue()
retstr.close()
return ret
Here's the traceback of the error I get:
Traceback (most recent call last):
File "/home/iammyr/opt/workspace/task-logger/task_logger/nlp/pre_processing.py", line 65, in <module>
obj.tokenizeStopWords()
File "/home/iammyr/opt/workspace/task-logger/task_logger/nlp/pre_processing.py", line 29, in tokenizeStopWords
filtered_tokens = [i for i in word_tokenize(text) if i not in stopset and i not in string.punctuation]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 93, in word_tokenize
return [token for sent in sent_tokenize(text)
[...]
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 9: ordinal not in range(128)
Thanks a million and loads of good karma to you! ;)
You are turning a piece of perfectly good Unicode string (back) into a bunch of untyped bytes, which Python has no idea how to handle, but desperately tries to apply the ASCII codec on. Remove the .encode('utf-8')
and you should be fine.