I'm looking for the best way to convert HTML to text, using only modules from the Python 2.7.x standard library. (I.e., no BeautifulSoup
, etc.)
By HTML-to-text conversion I mean the moral equivalent of lynx -dump
. In fact, just getting rid of HTML tags intelligently, and converting all HTML-entities to ASCII (or to UTF8-encoded unicode), would suffice.
No regex-based answers, please. (Regexes are not up to the task.)
Thanks!
Python since 2.2 has HTMLParser module. It's not the most efficient nor the easiest use, but it's there...
And if you're dealing with proper XHTML (or you can pass it through Tidy), you can use much better ElementTree
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse("your_document.xhtml")
your_string = tree.tostring(method="text", encoding="utf-8")