Search code examples
pythonhtmlwindowsbeautifulsoupchardet

Error while parsing a page with BeautifulSoup4, Chardet and Python 3.3 in Windows


I get the following error when I try to call BeautifulSoup(page)

Traceback (most recent call last):
 File "error.py", line 10, in <module>
  soup = BeautifulSoup(page)
 File "C:\Python33\lib\site-packages\bs4\__init__.py", line 169, in __init__
  self.builder.prepare_markup(markup, from_encoding))
 File "C:\Python33\lib\site-packages\bs4\builder\_htmlparser.py", line 136, in
 prepare_markup
  dammit = UnicodeDammit(markup, try_encodings, is_html=True)
 File "C:\Python33\lib\site-packages\bs4\dammit.py", line 223, in __init__
  u = self._convert_from(chardet_dammit(self.markup))
 File "C:\Python33\lib\site-packages\bs4\dammit.py", line 30, in chardet_dammit

   return chardet.detect(s)['encoding']
 File "C:\Python33\lib\site-packages\chardet\__init__.py", line 21, in detect
  import universaldetector
ImportError: No module named 'universaldetector'

I am running Python 3.3 in windows 7, I have installed bs4 from the setup.py by downloading the .tar.gz. I have installed pip and then installed chardet by doing pip.exe install chardet. My chardet version is 2.2.1. Bs4 works fine for other url.

Here's the code

import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import chardet

url = "http://www.edgar-online.com/brand/yahoo/search/?cik=1400810"
page = urlopen(url).read()
#print(page)
soup = BeautifulSoup(page)

I look forward to your answers


Solution

  • I meet this situation just now.
    Do not import chardet,and I also uninstall chardet.
    Then build would pass.
    below code is a part of dammit.py lib in beautifulsoup.
    Maybe you import a chardet not fits python 3.3, so the error occurs.

    try:
        # First try the fast C implementation.
        #  PyPI package: cchardet
        import cchardet
        def chardet_dammit(s):
            return cchardet.detect(s)['encoding']
    except ImportError:
        try:
            # Fall back to the pure Python implementation
            #  Debian package: python-chardet
            #  PyPI package: chardet
            import chardet
            def chardet_dammit(s):
                return chardet.detect(s)['encoding']
            #import chardet.constants
            #chardet.constants._debug = 1
        except ImportError:
            # No chardet available.
            def chardet_dammit(s):
                return None