Search code examples
pythonunicodeboilerpipe

Python 3 Unicode not found


I'm aware that unicode was changed to str in python 3 but I keep getting the same issue no matter how I write this code, can anyone tell me why?

I'm using boilerpipe for a specific set of webcrawls:

for urls in allUrls:
    fileW = open('article('+ str(counter)+')', 'w')
    articleDate = Article(urls)
    articleDate.download()
    articleDate.parse()
    print(articleDate.publish_date)
    fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
    fileW.close
    counter +=1

error:

 Traceback (most recent call last):
  File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 45, in __init__
    self.data = unicode(self.data, encoding)
NameError: name 'unicode' is not defined

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "webcrawl.py", line 26, in <module>
    fileW.write(str(Extractor(extractor='ArticleExtractor', url=urls).getText() + "\n\n\n" + str(articleDate.publish_date)+"\n\n\n"))
  File "/Users/Adrian/anaconda3/lib/python3.6/site-packages/boilerpipe/extract/__init__.py", line 47, in __init__
    self.data = self.data.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Solution

  • The error message is pointing to a line in boilerpipe/extract/__init__.py, which makes a call to the unicode built-in function.

    I assume the link below is the source code for the package you are using. If so, it appears to be written for Python 2.7, which you can see if you look near the end of this file:

    https://github.com/misja/python-boilerpipe/blob/master/setup.py

    You have several options as far as I can see:

    1. Find a Python 3 port of this package. There are at least a few out there (here's one and here's another).
    2. Port the package to Python 3 yourself (if that is the only error, you could simply change that line to use str, but later changes could cause problems with other parts of the package). This official tool should be of assistance; this official guide should, as well.
    3. Port you project to Python 2.7 and continue using the same package.

    I hope this helps!