python html-parsing standard-library html-to-text

html-to-text conversion using Python standard library only

I'm looking for the best way to convert HTML to text, using only modules from the Python 2.7.x standard library. (I.e., no BeautifulSoup, etc.)

By HTML-to-text conversion I mean the moral equivalent of lynx -dump. In fact, just getting rid of HTML tags intelligently, and converting all HTML-entities to ASCII (or to UTF8-encoded unicode), would suffice.

No regex-based answers, please. (Regexes are not up to the task.)

Thanks!

Solution

Python since 2.2 has HTMLParser module. It's not the most efficient nor the easiest use, but it's there...

And if you're dealing with proper XHTML (or you can pass it through Tidy), you can use much better ElementTree

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("your_document.xhtml")
your_string = tree.tostring(method="text", encoding="utf-8")