Search code examples
pythonexceptionparsingbeautifulsoupmalformed

Why is BeautifulSoup throwing this HTMLParseError?


I thought BeautifulSoup will be able to handle malformed documents, but when I sent it the source of a page, the following traceback got printed:


Traceback (most recent call last):
  File "mx.py", line 7, in 
    s = BeautifulSoup(content)
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1499, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1230, in __init__
  File "build\bdist.win32\egg\BeautifulSoup.py", line 1263, in _feed
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "C:\Python26\lib\HTMLParser.py", line 314, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "C:\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"", at line 258, column 34

Shouldn't it be able to handle this sort of stuff? If it can handle them, how could I do it? If not, is there a module that can handle malformed documents?

EDIT: here's an update. I saved the page locally, using firefox, and I tried to create a soup object from the contents of the file. That's where BeautifulSoup fails. If I try to create a soup object directly from the website, it works.Here's the document that causes trouble for soup.


Solution

  • Worked fine for me using BeautifulSoup version 3.0.7. The latest is 3.1.0, but there's a note on the BeautifulSoup home page to try 3.0.7a if you're having trouble. I think I ran into a similar problem as yours some time ago and reverted, which fixed the problem; I'd try that.

    If you want to stick with your current version, I suggest removing the large <script> block at the top, since that is where the error occurs, and since you cannot parse that section with BeautifulSoup anyway.