I'm aware that unicode was changed to str in python 3 but I keep getting the same issue no matter how I write this code, can anyone tell me why?
I'm using boilerpipe for a specific set of webcrawls:
for urls in allUrls:
fileW = open('article('+ str(counter)+')', 'w')
articleDate = Article(urls)
articleDate.download()
articleDate.parse()
print(articleDate.publish_date)
fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
fileW.close
counter +=1
error:
Traceback (most recent call last):
File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 45, in __init__
self.data = unicode(self.data, encoding)
NameError: name 'unicode' is not defined
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "webcrawl.py", line 26, in <module>
fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 47, in __init__
self.data = self.data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The error message is pointing to a line in boilerpipe/extract/__init__.py
, which makes a call to the unicode
built-in function.
I assume the link below is the source code for the package you are using. If so, it appears to be written for Python 2.7, which you can see if you look near the end of this file:
https://github.com/misja/python-boilerpipe/blob/master/setup.py
You have several options as far as I can see:
str
, but later changes could cause problems with other parts of the package). This official tool should be of assistance; this official guide should, as well.I hope this helps!